SlideShare a Scribd company logo
Microsoft's Production
Configurable Cloud
Mark Russinovich
Chief Technology Officer, Microsoft Azure
Context in 2010
2
• Moore’s Law was fine
• More cores, but single-thread
performance gains slowing
• No real focus on datacenter
accelerators
• FPGAs still strong in their
traditional markets
• But were a non-consensus bet for
compute
• But there were stormclouds on
the horizon
>90%
of Fortune 500 are on the
Microsoft Cloud
>80%
of the world’s largest banks are
Azure customers
>75%
of G-SIFIs have signed enterprise
agreements with Azure
FedRAMP High
DISA IL-4
ITAR
CJIS
Central US
Iowa
West US
California
East US
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo State
West Europe
Netherlands
China North
Beijing
China South
Shanghai
Japan East
Tokyo, Saitama
Japan West
Osaka
India South
Chennai
East Asia
Hong Kong
SE Asia
Singapore
Australia South East
Victoria
Australia East
New South Wales
India Central
Pune
Canada East
Quebec City
Canada Central
Toronto
India West
Mumbai
Germany North East
Magdeburg
Germany Central
Frankfurt
North Europe
Ireland
East US 2
Virginia
United Kingdom
Regions
US DoD East
Virginia
US DoD West
Iowa
France
Paris France
Marseille
Korea
Seoul
Korea
Busan
*New
Training
Inference
Client Cloud
Humans
ASICs
GPUs
?
What Drives a Post-CPU “Enhanced” Cloud?
Efficiency
(ASICS)
Homogeneity
7
What is FPGA Technology?
• Field Programmable Gate Array
• Programmable hardware
• Can be rewritten with new image (bitstream) in seconds,
soon to be 100s of ms
• Chip has large quantities of programmable units
• Network, memories, and logic (LUTs)
• Program specialized circuits that communicate
directly
• Stored as bit tables rather than polygons of materials
• Can build functional units, state machines, networking
circuits, etc.
• Need to program them in same languages (e.g.
Verilog) used to design ASIC chips
• FPGA chips are now large SoCs
• Thousands of hardened DSP blocks, DRAM controllers,
PCIe controllers, and now ARM cores
• Now: growing process gap between ASICs and “big
iron” (CPU, FPGA, GPU)
8
RAMRAM
DSPDSPDSP
NetworkPCIe
GenericLogic
(LUTs)
Specialized I/O Blocks
DSPMultiplierBlocks
20 Kb Dual Port RAMs
2/14/2017
Microsoft Confidential
First try: v0
• Use commodity SuperMicro servers
• 6 Xilinx LX240T FPGAs
• One appliance per rack
• All rack machines communicate over 1Gb
Ethernet
10
• 1U rack-mounted
• 2 x 10Ge ports
• 3 x16 PCIe slots
• 12 Intel Westmere
cores (2 sockets)
No production:
• Additional single point of failure
• Additional SKU to maintain
• Too much load on the network
• Inelastic FPGA scaling or stranded capacity
Second try: V1
11
• Altera Stratix V D5
• 172.6K ALMs, 2014 M20Ks
• 457KLEs
• 1 KLE == ~12K gates
• M20K is a 2.5KB SRAM
• PCIe Gen 2 x8, 8GB DDR3
• 20 Gb network among FPGAs
Stratix V
8GB DDR3
PCIe Gen3 x8
Mapped Fabric into a Pod
• Low-latency access to a local FPGA
• Compose multiple FPGAs to accelerate large workloads
• Low-latency, high-bandwidth sharing of storage and
memory across server boundaries
Server 1
FPGA
Server 48
FPGA
Top Of Rack Switch (TOR)
Server 2
FPGA
Server 47
FPGA
Server 3
FPGA
Server 46
FPGA
Server 4
FPGA
Server 45
FPGA
Server 23
FPGA
Server 26
FPGA
Server 24
FPGA
Server 25
FPGA
…
…
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
S0 S0
S0 S0
S0 S0
S1 S1
S2 S2
S2 S2
10GbEthernetLinks
FPGA Torus
12
Built Three Programmable Engines for Bing
Complex ALU
Ln, ÷, div
Basic Tile
Basic Tile
Basic Tile
Basic Tile
Registers
Constants
FFE 1
Inst.
FFE n
Inst.
Compression
Thresholds
…
Local
ALU
DSPDSP
SchedulingLogic
Distribution latches
Control/Data Tokens
Feature
Transmissio
n
Network
Stream
Preprocessing
FSM
FE FFE DTS
FE0
89 Non-BodyBlock
Features
34 State Machines
55 % Utilization
FE1
55 BodyBlock
Features
20 State Machines
45 % Utilization
DTT [3][7]
DTT [3][6]
DTT [3][5]
DTT [3][4]
DTT [3][3]
DTT [3][2]
DTT [3][1]
DTT [3][0]
DTT [3][11]
DTT [3][10]
DTT [3][9]
DTT [3][8]
DTT [2][7]
DTT [2][6]
DTT [2][5]
DTT [2][4]
DTT [2][3]
DTT [2][2]
DTT [2][1]
DTT [2][0]
DTT [2][11]
DTT [2][10]
DTT [2][9]
DTT [2][8]
DTT [1][7]
DTT [1][6]
DTT [1][5]
DTT [1][4]
DTT [1][3]
DTT [1][2]
DTT [1][1]
DTT [1][0]
DTT [1][11]
DTT [1][10]
DTT [1][9]
DTT [1][8]
DTT [0][7]
DTT [0][6]
DTT [0][5]
DTT [0][4]
DTT [0][3]
DTT [0][2]
DTT [0][1]
DTT [0][0]
DTT [0][11]
DTT [0][10]
DTT [0][9]
DTT [0][8]
FFE [1][3]
FFE [1][2]
FFE [1][1]
FFE [1][0]
FFE [0][3]
FFE [0][2]
FFE [0][1]
FFE [0][0]
FFE: 64 cores / chip
256-512 threads
DTT: 48 DTT tiles/chip
240 tree processors
2880 trees/chip
1,632 server pilot deployed in BN2
14
No production:
• Microsoft was converging on a single SKU
• No one else wanted the secondary network
• Complex, difficult to handle failures
• Difficult to service boxes
• No killer infrastructure accelerator
• Application presence is too small
Hyperscale SDN:
Building the Right Abstractions
Management
Control
Data
Proprietary
Appliance
Management plane Create a tenant
Control plane Plumb tenant ACLs to
switches
Data plane Apply ACLs to flows
Azure Resource
Manager
Controller
Switch (Host)
Management Plane
Data Plane
SDN
Control Plane
Key to flexibility and scale is Host SDN
Virtual Filtering Platform (VFP)
• Acts as a virtual switch inside Hyper-V VMSwitch
• Provides core SDN functionality for Azure
networking services, including:
• Address Virtualization for VNET
• VIP -> DIP Translation for SLB
• ACLs, Metering, and Security Guards
• Uses programmable rule/flow tables to perform
per-packet actions
• Supports all Azure data plane policy at 40GbE+
with offloads
• Coming to private cloud in Windows Server 2016
NIC vNIC
VM Switch
VFP
VM
vNIC
VM
ACLs, Metering, Security
VNET
SLB (NAT)
Host: 10.4.1.5
Flow Tables: the Right Abstraction for the Host
• VMSwitch exposes a typed Match-
Action-Table API to the controller
• Controllers define policy
• One table per policy
• Key insight: Let controller tell switch
exactly what to do with which
packets
• e.g. encap/decap, rather than trying to
use existing abstractions (tunnels, …)
Tenant Description
VNet Description
VNet Routing
Policy
ACLsNAT
Endpoints
VFP
VM1
10.1.1.2
NIC
Flow ActionFlow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Controller
Host SDN Scale Challenges
• Hosts are Scaling Up: 1G  10G  40G  50G  100G
• Reduces COGS of VMs (more VMs per host) and enables new workloads
• Need the performance of hardware to implement policy without CPU
• Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance
• We are always pushing richer semantics to virtual networks
• Need the programmability of software to be agile and future-proof
How do we get the performance of hardware
with programmability of software?
Azure SmartNIC
• Use an FPGA for reconfigurable functions
• FPGAs are already used in Bing (Catapult)
• Roll out Hardware as we do software
• Programmed using Generic Flow Tables
(GFT)
• Language for programming SDN to hardware
• Uses connections and structured actions as
primitives
• SmartNIC can also do Crypto, QoS, storage
acceleration, and more…
Host
CPU
NIC
ASIC
FPGA
SmartNIC
ToR
Flow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP
Southbound API
GFT Offload API (NDIS)
VMSwitch
VM
Northbound API
GFT
Table
First Packet
GFT Offload Engine
SmartNIC
50G
QoSCrypto RDMAFlow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT
20
SmartNIC
Transposition
Engine
Rewrite
SLB Decap SLB NAT VNET ACL Metering
Rule Action Rule ActionRule Action Rule Action Rule Action Rule Action
Decap* DNAT* Rewrite* Allow* Meter*
ControllerControllerController
Encap
Catapult V2: This one works
21
CPU CPU FPGA
NIC
DRAM DRAM DRAM
WCS 2.0 Server Blade Catapult V2
Gen3 2x8
Gen3 x8
QPI Switch
QSFP
QSFP
QSFP
40Gb/s
40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak
WCS Tray
Backplane
Option Card
Mezzanine
Connectors
Catapult v2 Mezzanine card
• The architecture justifies the economics
1. Can act as a local compute accelerator
2. Can act as a network/storage accelerator
3. Can act as a remote compute accelerator
Configurable Cloud
CPU compute layer
Reconfigurable
compute layer
Converged network
Local acceleration
Production Results (December 2015)
24
software
FPGA
99.9% Query Latency versus Queries/sec
HWvs.SWLatencyandLoad
average software load
99.9% software latency
99.9% FPGA latency
average FPGA query load
Infrastructure acceleration
Azure Accelerated Networking: Fastest Cloud Network!
• Highest bandwidth VMs of any cloud
• DS15v2 & D15v2 VMs get up to 25Gbps with <25μs latency
• Consistent low latency network performance
• Provides SR-IOV to the VM
• 10x latency improvement
• Increased packets per second (PPS)
• Reduced jitter means more consistency in workloads
• Enables workloads requiring native performance to run in cloud VMs
• >2x improvement for many DB and OLTP applications
Accelerated Networking Internals
SDN/Networking policy applied in software
in the host
FPGA acceleration used to apply all policies
Remote acceleration
Azure Data Center Network Fabrics with 40G NICs
T2-1-1 T2-1-2 T2-1-8
T3-1 T3-2 T3-3 T3-4
Row Spine
T2-4-1 T2-4-2 T2-4-4Data Center Spine
T1-1 T1-8T1-7…T1-2
… …
Regional Spine
…
T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2
Rack …T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
>10 years of experience, with major revisions every six months
Scale-out, active-active
Up to 128 switches wide!
Architecture of a Configurable Cloud
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
CS0 CS1 CS2 CS3
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
SP0 SP1 SP2 SP3 • FPGAs can encapsulate
their own UDB packets
• Low-latency inter-FPGA
communication
• Can provide strong
network primitives
• Reliable transport
• Smart transport
• But this topology opens
up other opportunities
L0
L1/L2
30
Credits
Virtual Channel
Data
Header
Elastic
Router
(multi-VC
on-chip
router)
Send Connection Table
Transmit State
Machine
Send
Frame
QueueConnection
Lookup
Packetizer
and
Transmit
Buffer
Unack’d
Frame
Store
Ethernet
Encap
Ethernet
Decap
40G
MAC+PHY
Receive Connection Table
Credits
Virtual Channel
Data
Header
Depacketizer
Credit
Management
Ack Receiver
Ack Generation
Receive State
Machine
Solid links show Data flow, Dotted links show ACK flow
Datacenter
Network
Lightweight Transport Layer (LTL)
LTL Enables Iron Channels
Server ServerFPGA FPGANetwork
FPGA-to-FPGA round-trip latencies over LTL
HaaS: Deploying Hardware Microservices
ToR ToR
CS CS
Bing Ranking HW
ToR ToR
Bing Ranking SW
34
HMM
LB
Large-scale
deep learning
Audio
Decode
Benefits of HaaS
Flexibility: many services need a large number of FPGAs, others
underutilize theirs
Many accelerators can handle load of multiple software clients
Consolidate underutilized FPGA accelerators into fewer shared
instances
Increases efficiency & makes room for more accelerators
Many datacenter services need to access multiple types of
accelerators Ranking
DNN
Programmable DNNs on FPGA
• Programmed in C++
• No Verilog skills required
• Accelerator primitives
• Matrix-vector multiply
• Vector-vector add/sub, multiply
• Element-wise ops: sigm, tanh, etc
• 1.2 TeraOps / FPGA
• 16-bit fixed point
• 3000 MulAdds/cycle @ 200MHz
• No minibatch required
Aggressive ML: Scalable DNNs over HaaS
• Ultra low latency, high throughput evaluation
• Long term: in-situ training for model freshness
& training within compliance boundaries
• Achieve high ops/$ and ops/W vs. CPUs
F F F
L0
L1
F F F
L0
NN Model FPGAs over HaaS and LTL FPGA vector Engine
Instr Decoder &
Control
Neural FU
Microsoft FPGA DNN goals
• 1.2 TOPs of 16b fixed point  inference in
hundreds of us or few ms
• No Verilog expertise required
• Engines can be composed to support large
scale models
MS Engine: SW-Programmable DNN Engine
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
GigaOp/sec
Batch Size
3072x1024 Matrix Multiplication (for high-dimensional LSTM Evaluation)
Series1 Series2 Series3
FPGAs Excellent for Low-Latency DNN inference
Operating range for online
interactive services (e.g., search)
Peak Performance of
low-power Nvidia
M4 GPU @ high
batch sizes
Low Level AI Representation (LLAIR)
Objectives
• Serve pre-trained CNTK and
TensorFlow models on
FPGA, CPU, and other
backends
• Export models to
Framework-neutral Low
Level AI Representation
(LLAIR)
• Develop federated and
modular runtime for
backward compatibility
with existing DNN runtimes
while allowing extensibility
CNTK or TF
Model
FPGA CPU
Add500
1000-dim Vector
1000-dim Vector
Split
500x500
Matrix
MatMul500
500x500
Matrix
MatMul500 MatMul500 MatMul500
500x500
Matrix
Add500
Add500
Sigmoid500 Sigmoid500
Split
Add500
500 500
Concat
500 500
500x500
Matrix
Framework
Neutral
LLAIR
Federated Runtime Executes
LLAIR Subgraphs
CNTK
Exporter
TensorFlow
Exporter
LLAIR File
Transformer
Subgraph-Compiler
Custom
Compiler
LLAIR / ModelBundle
Packager
HaaS-AP
CNTK
Compiler
FPGA
Compiler
SearchGold SingleBox
Model Bundle
Federated Runtime
Custom
Backend
CNTK
Backend
FPGA
Backend
Deploy Bundle
Optimized &
Partitioned
LLAIR
Configurable Clouds will Change the World
• Ability to reprogram a datacenter’s hardware protocols
• Networking, storage, security
• Can turn homogenous machines into specialized SKUs dynamically
• Unprecedented performance and low latency at hyperscale
• Exa-ops of performance with a 10 microsecond diameter
• What would you do with the world’s most powerful fabric?
40
Microsofts Configurable Cloud

More Related Content

What's hot

DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
Jim St. Leger
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
Intel
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Jim St. Leger
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
Lagopus SDN/OpenFlow switch
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDK
OPNFV
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
Jim St. Leger
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
Michelle Holley
 
6WIND Virtual Accelerator Product Presentation
6WIND Virtual Accelerator Product Presentation6WIND Virtual Accelerator Product Presentation
6WIND Virtual Accelerator Product Presentation
6WIND
 
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
videos
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
Jim St. Leger
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Michelle Holley
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
Michelle Holley
 
6WINDGate™ - Powering the New Generation of Network Appliances
6WINDGate™ - Powering the New Generation of Network Appliances6WINDGate™ - Powering the New Generation of Network Appliances
6WINDGate™ - Powering the New Generation of Network Appliances
6WIND
 
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
OpenStack Korea Community
 

What's hot (20)

DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDK
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
6WIND Virtual Accelerator Product Presentation
6WIND Virtual Accelerator Product Presentation6WIND Virtual Accelerator Product Presentation
6WIND Virtual Accelerator Product Presentation
 
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
 
6WINDGate™ - Powering the New Generation of Network Appliances
6WINDGate™ - Powering the New Generation of Network Appliances6WINDGate™ - Powering the New Generation of Network Appliances
6WINDGate™ - Powering the New Generation of Network Appliances
 
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
 

Similar to Microsofts Configurable Cloud

Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
Open Networking Summit
 
QNAP for IoT
QNAP for IoTQNAP for IoT
QNAP for IoT
qnapivan
 
02 Dell Blade Server Day 1
02 Dell Blade Server Day 102 Dell Blade Server Day 1
02 Dell Blade Server Day 1
ALAMGIR HOSSAIN
 
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
Citrix
 
Building the SD-Branch using uCPE
Building the SD-Branch using uCPEBuilding the SD-Branch using uCPE
Building the SD-Branch using uCPE
Michelle Holley
 
Onboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareOnboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking Software
Cloudify Community
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVM
Zainal Abidin
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
The Linux Foundation
 
如何用k8s打造國產5G NFV平臺? 剖析經濟部5G核網技術的關鍵
如何用k8s打造國產5G NFV平臺?剖析經濟部5G核網技術的關鍵如何用k8s打造國產5G NFV平臺?剖析經濟部5G核網技術的關鍵
如何用k8s打造國產5G NFV平臺? 剖析經濟部5G核網技術的關鍵
Jace Liang
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
JunZhao68
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
LinuxCon ContainerCon CloudOpen China
 
Window server 2008
Window server 2008Window server 2008
Window server 2008
IGZ Software house
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
Rakuten Group, Inc.
 
Intel xeon e5v3 y sdi
Intel xeon e5v3 y sdiIntel xeon e5v3 y sdi
Intel xeon e5v3 y sdi
Telecomputer
 
Rendering in the Cloud
Rendering in the CloudRendering in the Cloud
Rendering in the Cloud
Benjamin Shrive
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
Randy Bias
 
HP Blades Presentation
HP Blades PresentationHP Blades Presentation
HP Blades Presentation
Bhavin Vyas
 
Citrix Synergy 2014: Going the CloudPlatform Way
Citrix Synergy 2014: Going the CloudPlatform WayCitrix Synergy 2014: Going the CloudPlatform Way
Citrix Synergy 2014: Going the CloudPlatform Way
Iliyas Shirol
 

Similar to Microsofts Configurable Cloud (20)

Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
 
QNAP for IoT
QNAP for IoTQNAP for IoT
QNAP for IoT
 
02 Dell Blade Server Day 1
02 Dell Blade Server Day 102 Dell Blade Server Day 1
02 Dell Blade Server Day 1
 
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
Citrix Synergy 2014 - Syn233 Building and operating a Dev Ops cloud: best pra...
 
Building the SD-Branch using uCPE
Building the SD-Branch using uCPEBuilding the SD-Branch using uCPE
Building the SD-Branch using uCPE
 
Onboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking SoftwareOnboarding and Orchestrating High Performing Networking Software
Onboarding and Orchestrating High Performing Networking Software
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVM
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
 
如何用k8s打造國產5G NFV平臺? 剖析經濟部5G核網技術的關鍵
如何用k8s打造國產5G NFV平臺?剖析經濟部5G核網技術的關鍵如何用k8s打造國產5G NFV平臺?剖析經濟部5G核網技術的關鍵
如何用k8s打造國產5G NFV平臺? 剖析經濟部5G核網技術的關鍵
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
 
Window server 2008
Window server 2008Window server 2008
Window server 2008
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Intel xeon e5v3 y sdi
Intel xeon e5v3 y sdiIntel xeon e5v3 y sdi
Intel xeon e5v3 y sdi
 
Rendering in the Cloud
Rendering in the CloudRendering in the Cloud
Rendering in the Cloud
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
 
HP Blades Presentation
HP Blades PresentationHP Blades Presentation
HP Blades Presentation
 
Citrix Synergy 2014: Going the CloudPlatform Way
Citrix Synergy 2014: Going the CloudPlatform WayCitrix Synergy 2014: Going the CloudPlatform Way
Citrix Synergy 2014: Going the CloudPlatform Way
 

Microsofts Configurable Cloud

  • 1. Microsoft's Production Configurable Cloud Mark Russinovich Chief Technology Officer, Microsoft Azure
  • 2. Context in 2010 2 • Moore’s Law was fine • More cores, but single-thread performance gains slowing • No real focus on datacenter accelerators • FPGAs still strong in their traditional markets • But were a non-consensus bet for compute • But there were stormclouds on the horizon
  • 3.
  • 4. >90% of Fortune 500 are on the Microsoft Cloud >80% of the world’s largest banks are Azure customers >75% of G-SIFIs have signed enterprise agreements with Azure FedRAMP High DISA IL-4 ITAR CJIS
  • 5. Central US Iowa West US California East US Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo State West Europe Netherlands China North Beijing China South Shanghai Japan East Tokyo, Saitama Japan West Osaka India South Chennai East Asia Hong Kong SE Asia Singapore Australia South East Victoria Australia East New South Wales India Central Pune Canada East Quebec City Canada Central Toronto India West Mumbai Germany North East Magdeburg Germany Central Frankfurt North Europe Ireland East US 2 Virginia United Kingdom Regions US DoD East Virginia US DoD West Iowa France Paris France Marseille Korea Seoul Korea Busan *New
  • 7. What Drives a Post-CPU “Enhanced” Cloud? Efficiency (ASICS) Homogeneity 7
  • 8. What is FPGA Technology? • Field Programmable Gate Array • Programmable hardware • Can be rewritten with new image (bitstream) in seconds, soon to be 100s of ms • Chip has large quantities of programmable units • Network, memories, and logic (LUTs) • Program specialized circuits that communicate directly • Stored as bit tables rather than polygons of materials • Can build functional units, state machines, networking circuits, etc. • Need to program them in same languages (e.g. Verilog) used to design ASIC chips • FPGA chips are now large SoCs • Thousands of hardened DSP blocks, DRAM controllers, PCIe controllers, and now ARM cores • Now: growing process gap between ASICs and “big iron” (CPU, FPGA, GPU) 8 RAMRAM DSPDSPDSP NetworkPCIe GenericLogic (LUTs) Specialized I/O Blocks DSPMultiplierBlocks 20 Kb Dual Port RAMs 2/14/2017
  • 9.
  • 10. Microsoft Confidential First try: v0 • Use commodity SuperMicro servers • 6 Xilinx LX240T FPGAs • One appliance per rack • All rack machines communicate over 1Gb Ethernet 10 • 1U rack-mounted • 2 x 10Ge ports • 3 x16 PCIe slots • 12 Intel Westmere cores (2 sockets) No production: • Additional single point of failure • Additional SKU to maintain • Too much load on the network • Inelastic FPGA scaling or stranded capacity
  • 11. Second try: V1 11 • Altera Stratix V D5 • 172.6K ALMs, 2014 M20Ks • 457KLEs • 1 KLE == ~12K gates • M20K is a 2.5KB SRAM • PCIe Gen 2 x8, 8GB DDR3 • 20 Gb network among FPGAs Stratix V 8GB DDR3 PCIe Gen3 x8
  • 12. Mapped Fabric into a Pod • Low-latency access to a local FPGA • Compose multiple FPGAs to accelerate large workloads • Low-latency, high-bandwidth sharing of storage and memory across server boundaries Server 1 FPGA Server 48 FPGA Top Of Rack Switch (TOR) Server 2 FPGA Server 47 FPGA Server 3 FPGA Server 46 FPGA Server 4 FPGA Server 45 FPGA Server 23 FPGA Server 26 FPGA Server 24 FPGA Server 25 FPGA … … DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS S0 S0 S0 S0 S0 S0 S1 S1 S2 S2 S2 S2 10GbEthernetLinks FPGA Torus 12
  • 13. Built Three Programmable Engines for Bing Complex ALU Ln, ÷, div Basic Tile Basic Tile Basic Tile Basic Tile Registers Constants FFE 1 Inst. FFE n Inst. Compression Thresholds … Local ALU DSPDSP SchedulingLogic Distribution latches Control/Data Tokens Feature Transmissio n Network Stream Preprocessing FSM FE FFE DTS FE0 89 Non-BodyBlock Features 34 State Machines 55 % Utilization FE1 55 BodyBlock Features 20 State Machines 45 % Utilization DTT [3][7] DTT [3][6] DTT [3][5] DTT [3][4] DTT [3][3] DTT [3][2] DTT [3][1] DTT [3][0] DTT [3][11] DTT [3][10] DTT [3][9] DTT [3][8] DTT [2][7] DTT [2][6] DTT [2][5] DTT [2][4] DTT [2][3] DTT [2][2] DTT [2][1] DTT [2][0] DTT [2][11] DTT [2][10] DTT [2][9] DTT [2][8] DTT [1][7] DTT [1][6] DTT [1][5] DTT [1][4] DTT [1][3] DTT [1][2] DTT [1][1] DTT [1][0] DTT [1][11] DTT [1][10] DTT [1][9] DTT [1][8] DTT [0][7] DTT [0][6] DTT [0][5] DTT [0][4] DTT [0][3] DTT [0][2] DTT [0][1] DTT [0][0] DTT [0][11] DTT [0][10] DTT [0][9] DTT [0][8] FFE [1][3] FFE [1][2] FFE [1][1] FFE [1][0] FFE [0][3] FFE [0][2] FFE [0][1] FFE [0][0] FFE: 64 cores / chip 256-512 threads DTT: 48 DTT tiles/chip 240 tree processors 2880 trees/chip
  • 14. 1,632 server pilot deployed in BN2 14 No production: • Microsoft was converging on a single SKU • No one else wanted the secondary network • Complex, difficult to handle failures • Difficult to service boxes • No killer infrastructure accelerator • Application presence is too small
  • 15. Hyperscale SDN: Building the Right Abstractions Management Control Data Proprietary Appliance Management plane Create a tenant Control plane Plumb tenant ACLs to switches Data plane Apply ACLs to flows Azure Resource Manager Controller Switch (Host) Management Plane Data Plane SDN Control Plane Key to flexibility and scale is Host SDN
  • 16. Virtual Filtering Platform (VFP) • Acts as a virtual switch inside Hyper-V VMSwitch • Provides core SDN functionality for Azure networking services, including: • Address Virtualization for VNET • VIP -> DIP Translation for SLB • ACLs, Metering, and Security Guards • Uses programmable rule/flow tables to perform per-packet actions • Supports all Azure data plane policy at 40GbE+ with offloads • Coming to private cloud in Windows Server 2016 NIC vNIC VM Switch VFP VM vNIC VM ACLs, Metering, Security VNET SLB (NAT)
  • 17. Host: 10.4.1.5 Flow Tables: the Right Abstraction for the Host • VMSwitch exposes a typed Match- Action-Table API to the controller • Controllers define policy • One table per policy • Key insight: Let controller tell switch exactly what to do with which packets • e.g. encap/decap, rather than trying to use existing abstractions (tunnels, …) Tenant Description VNet Description VNet Routing Policy ACLsNAT Endpoints VFP VM1 10.1.1.2 NIC Flow ActionFlow ActionFlow Action TO: 10.2/16 Encap to GW TO: 10.1.1.5 Encap to 10.5.1.7 TO: !10/8 NAT out of VNET Flow ActionFlow Action TO: 79.3.1.2 DNAT to 10.1.1.2 TO: !10/8 SNAT to 79.3.1.2 Flow Action TO: 10.1.1/24 Allow 10.4/16 Block TO: !10/8 Allow VNET LB NAT ACLS Controller
  • 18. Host SDN Scale Challenges • Hosts are Scaling Up: 1G  10G  40G  50G  100G • Reduces COGS of VMs (more VMs per host) and enables new workloads • Need the performance of hardware to implement policy without CPU • Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance • We are always pushing richer semantics to virtual networks • Need the programmability of software to be agile and future-proof How do we get the performance of hardware with programmability of software?
  • 19. Azure SmartNIC • Use an FPGA for reconfigurable functions • FPGAs are already used in Bing (Catapult) • Roll out Hardware as we do software • Programmed using Generic Flow Tables (GFT) • Language for programming SDN to hardware • Uses connections and structured actions as primitives • SmartNIC can also do Crypto, QoS, storage acceleration, and more… Host CPU NIC ASIC FPGA SmartNIC ToR
  • 20. Flow Action Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP Southbound API GFT Offload API (NDIS) VMSwitch VM Northbound API GFT Table First Packet GFT Offload Engine SmartNIC 50G QoSCrypto RDMAFlow Action Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80 GFT 20 SmartNIC Transposition Engine Rewrite SLB Decap SLB NAT VNET ACL Metering Rule Action Rule ActionRule Action Rule Action Rule Action Rule Action Decap* DNAT* Rewrite* Allow* Meter* ControllerControllerController Encap
  • 21. Catapult V2: This one works 21 CPU CPU FPGA NIC DRAM DRAM DRAM WCS 2.0 Server Blade Catapult V2 Gen3 2x8 Gen3 x8 QPI Switch QSFP QSFP QSFP 40Gb/s 40Gb/s WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA Pikes Peak WCS Tray Backplane Option Card Mezzanine Connectors Catapult v2 Mezzanine card • The architecture justifies the economics 1. Can act as a local compute accelerator 2. Can act as a network/storage accelerator 3. Can act as a remote compute accelerator
  • 22. Configurable Cloud CPU compute layer Reconfigurable compute layer Converged network
  • 24. Production Results (December 2015) 24 software FPGA 99.9% Query Latency versus Queries/sec HWvs.SWLatencyandLoad average software load 99.9% software latency 99.9% FPGA latency average FPGA query load
  • 26. Azure Accelerated Networking: Fastest Cloud Network! • Highest bandwidth VMs of any cloud • DS15v2 & D15v2 VMs get up to 25Gbps with <25μs latency • Consistent low latency network performance • Provides SR-IOV to the VM • 10x latency improvement • Increased packets per second (PPS) • Reduced jitter means more consistency in workloads • Enables workloads requiring native performance to run in cloud VMs • >2x improvement for many DB and OLTP applications
  • 27. Accelerated Networking Internals SDN/Networking policy applied in software in the host FPGA acceleration used to apply all policies
  • 29. Azure Data Center Network Fabrics with 40G NICs T2-1-1 T2-1-2 T2-1-8 T3-1 T3-2 T3-3 T3-4 Row Spine T2-4-1 T2-4-2 T2-4-4Data Center Spine T1-1 T1-8T1-7…T1-2 … … Regional Spine … T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2 Rack …T0-1 T0-2 T0-20 Servers …T0-1 T0-2 T0-20 Servers …T0-1 T0-2 T0-20 Servers >10 years of experience, with major revisions every six months Scale-out, active-active Up to 128 switches wide!
  • 30. Architecture of a Configurable Cloud ToR FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server CS0 CS1 CS2 CS3 ToR FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server SP0 SP1 SP2 SP3 • FPGAs can encapsulate their own UDB packets • Low-latency inter-FPGA communication • Can provide strong network primitives • Reliable transport • Smart transport • But this topology opens up other opportunities L0 L1/L2 30
  • 31. Credits Virtual Channel Data Header Elastic Router (multi-VC on-chip router) Send Connection Table Transmit State Machine Send Frame QueueConnection Lookup Packetizer and Transmit Buffer Unack’d Frame Store Ethernet Encap Ethernet Decap 40G MAC+PHY Receive Connection Table Credits Virtual Channel Data Header Depacketizer Credit Management Ack Receiver Ack Generation Receive State Machine Solid links show Data flow, Dotted links show ACK flow Datacenter Network Lightweight Transport Layer (LTL)
  • 32. LTL Enables Iron Channels Server ServerFPGA FPGANetwork
  • 34. HaaS: Deploying Hardware Microservices ToR ToR CS CS Bing Ranking HW ToR ToR Bing Ranking SW 34 HMM LB Large-scale deep learning Audio Decode
  • 35. Benefits of HaaS Flexibility: many services need a large number of FPGAs, others underutilize theirs Many accelerators can handle load of multiple software clients Consolidate underutilized FPGA accelerators into fewer shared instances Increases efficiency & makes room for more accelerators Many datacenter services need to access multiple types of accelerators Ranking DNN
  • 36. Programmable DNNs on FPGA • Programmed in C++ • No Verilog skills required • Accelerator primitives • Matrix-vector multiply • Vector-vector add/sub, multiply • Element-wise ops: sigm, tanh, etc • 1.2 TeraOps / FPGA • 16-bit fixed point • 3000 MulAdds/cycle @ 200MHz • No minibatch required
  • 37. Aggressive ML: Scalable DNNs over HaaS • Ultra low latency, high throughput evaluation • Long term: in-situ training for model freshness & training within compliance boundaries • Achieve high ops/$ and ops/W vs. CPUs F F F L0 L1 F F F L0 NN Model FPGAs over HaaS and LTL FPGA vector Engine Instr Decoder & Control Neural FU Microsoft FPGA DNN goals • 1.2 TOPs of 16b fixed point  inference in hundreds of us or few ms • No Verilog expertise required • Engines can be composed to support large scale models MS Engine: SW-Programmable DNN Engine
  • 38. 0 1000 2000 3000 4000 5000 6000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 GigaOp/sec Batch Size 3072x1024 Matrix Multiplication (for high-dimensional LSTM Evaluation) Series1 Series2 Series3 FPGAs Excellent for Low-Latency DNN inference Operating range for online interactive services (e.g., search) Peak Performance of low-power Nvidia M4 GPU @ high batch sizes
  • 39. Low Level AI Representation (LLAIR) Objectives • Serve pre-trained CNTK and TensorFlow models on FPGA, CPU, and other backends • Export models to Framework-neutral Low Level AI Representation (LLAIR) • Develop federated and modular runtime for backward compatibility with existing DNN runtimes while allowing extensibility CNTK or TF Model FPGA CPU Add500 1000-dim Vector 1000-dim Vector Split 500x500 Matrix MatMul500 500x500 Matrix MatMul500 MatMul500 MatMul500 500x500 Matrix Add500 Add500 Sigmoid500 Sigmoid500 Split Add500 500 500 Concat 500 500 500x500 Matrix Framework Neutral LLAIR Federated Runtime Executes LLAIR Subgraphs CNTK Exporter TensorFlow Exporter LLAIR File Transformer Subgraph-Compiler Custom Compiler LLAIR / ModelBundle Packager HaaS-AP CNTK Compiler FPGA Compiler SearchGold SingleBox Model Bundle Federated Runtime Custom Backend CNTK Backend FPGA Backend Deploy Bundle Optimized & Partitioned LLAIR
  • 40. Configurable Clouds will Change the World • Ability to reprogram a datacenter’s hardware protocols • Networking, storage, security • Can turn homogenous machines into specialized SKUs dynamically • Unprecedented performance and low latency at hyperscale • Exa-ops of performance with a 10 microsecond diameter • What would you do with the world’s most powerful fabric? 40