AI Accelerators for Cloud Datacenters

AI Accelerators for Cloud Datacenters
Prof. Joo-Young Kim
7/10/2020 @ 산업교육연구소

Agenda
2
• Introduction
• Cloud Infrastructure
- Datacenter challenges
- Microsoft Catapult
• AI Accelerators for Datacenters
- Google TPU
- HabanaLabs Gaudi
- Graphcore IPU
- Baidu Kunlun
- Cerebras Wafer-Scale Engine
- Intel NNP-T Processor
• Summary

Cloud Services
3
End of Moore’s Law
200+
Capabilities, Operating Cost Saving ∝ Performance/Watt per $

Energy Efficiency Trade-Off
4
Source: Bob Broderson, Berkeley Wireless group

Datacenter Challenges
• Workload diversity
Software services change monthly
Number of applications increases
• Maintenance
Little HW maintenance, no accessibility
Machines last ~3 years, can be repurposed during lifetime
Homogeneity is critical to reduce cost
• Specialization
Slowing of Moore’s law performance scaling
Compute requirements increase beyond conventional CPU-only systems
5
*Cycles in 50 hottest binaries (%)
*S. Kanev, “Profiling a Warehouse-Scale Computer,” ISCA 2015

FPGA vs ASIC
6
Xeon CPU NICSearch Acc.
(FPGA)
Search Acc.
(ASIC)
Wasted Power,
Holds back SW
Xeon CPU NICSearch Acc. v2
(FPGA)
NICXeon CPU Math
Accelerator
Wasted Power,
One more thing that
can break

Catapult Gen1 (2014)
• Altera Stratix V D5
• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs
• PCIe Gen 3 x8
• 8GB DDR3-1333
• Powered by PCIe slot
• 6x8 Torus Network
7
Stratix V
8GB DDR3
PCIe Gen3 x8
4 x 20Gbps
transceiver
“Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services”, ISCA 2014

Open Compute Server
8
• Two 8-core Xeon 2.1 GHz CPUs
• 64 GB DRAM
• 4 HDDs @ 2 TB, 2 SSDs @ 512 GB
• 10 Gb Ethernet
• Plug-in FPGA via mezzanine connector
68 ⁰C
Mezz Conn.

Rack Design
9
• High density
- 1U (height:1.75inch), half-width servers
- Homogeneous design
- 1 FPGA per server, not enough for GPU
- Half rack: 2 x 24 servers
Server
Top Of Rack Switch (TOR)
Server
Server
Server
Server
Server
… …
D • Local Torus network
- Dedicated 6x8 torus enables multi-FPGA accelerators
- Requires additional cabling mapping physical 2x24 to
logical 6x8

Shell and Role
• Shell
- Operating system for FPGA
- Handles all I/O & management tasks
- Exposes simple FIFOs
• Role
- Only application logic
- Partial reconfiguration boundary
• Debug support
- Flight data recorder
- JTAG cable
10
West
SLIII
East
SLIII
South
SLIII
North
SLIII
x8 PCIe
Core
DMA
Engine
Config
Flash
(RSU)
DDR3 Core 1DDR3 Core 0
JTAG
LEDs
Temp
Sensors
Application
(Bing, Azure, DNN, etc.)
Shell
I2C
xcvr
reconfig
2 2 2 2
4 256 Mb
QSPI
Config
Flash
4 GB DDR3-
1333 ECC SO-
DIMM
4 GB DDR3-
1333 ECC SO-
DIMM
Host
CPU
72 72
Role
8
Inter-FPGA Router
SEU

Catapult Gen2 (2016)
11
• From Torus to Ethernet
- Bump-in-the-Wire (NIC FPGA Switch)
40G
NIC and TOR
FPGA
4GB DDR
2 x Gen3x8
PCIe
35W power budget
ToR
Switch
“A Cloud-Scale Acceleration Architecture,” Micro 2016

Integration to DC Infrastructure
12
…
A FPGA can communicate any FPGA in datacenter

Network Coverage and Latencies
13
0
5
10
15
20
25
1 10 100 1000 10000 100000 1000000
Round-TripLatency(us)
LTL L0 (same TOR)
LTL L1
Example L0 latency histogram
Example L1 latency histogram
Examples of L2 latency histograms for different pairs of FP
GAs
Number of Reachable Hosts/FPGAs
Catapult Gen1 Torus
(can reach up to 48 FPGAs)
LTL Average Laten
cyLTL 99.9th Percentile
6x8 Torus Latency
LTL L2
10K 100K 250K

Configurable Cloud
14
TOR TOR
L1
Storage
Deep neura
l networks
Web search
ranking
SQL
Web search
ranking
L2
TOR
L1
TOR

Bing Ranking Acceleration
15
99.9% SW latency
99.9% FPGA latency
Average FPGA query load Average SW load
Day 1 Day 2 Day 3 Day 4 Day 5
1.0
2.0
3.0
4.0
5.0
6.0
7.0
NormalizedLoad&Latency
• Lower latency than software even with 2x query load
• More consistent 99.9th tail latency

AI Chip Market
16
2000 2010 2020 2030
Deep Learning
Revolution
(AlexNet 2012)
Existing chips
GPU
AI Chip
> 1000
energy
efficiency
65B in 2025, 19% of semiconductor market, 18-19% growth per year
223 224
295
17 32
65
2017 2020E 2025E
93
88
81
7 11
19
2017 2020E 2025E
240 256
370
AI semiconductor
total market
($ billion)
AI semiconductor
total market
(%)
Non-AI AI
AI semiconductor
total market CAGR
2017-25, (%)
3-4
18-19
5x
Growth for AI Semiconductors
McKinsey report, 2019
RequiredperformanceforAI

AI Chip Industry
17
Google Facebook Microsoft Baidu Tesla HabanaLabs
TPUdeployment
(2016)
OpenCompute
initiative
(2011)
AIchipdevteam
(2019)
CatapultFPGA
deployment
(2014)
Brainwave
deployment
(2018)
Kunlun
inproduction
(2020)
FullSelfDriving
(FSD)chipfor
autonomous
vehicle(2019)
AItraining
processorGaudi
(2019)
And more (Graphcore, Cerabras, Intel, Groq, WaveComputing, ..)

Google TPU
• Simple architecture to support MLP, CNN, and
RNN models and have fast development
- Host interface
- Unified buffer (24 MB), weight FIFO
- Matrix multiply unit (256 x 256)
- Accumulators (256, 4 MB buffers)
- 8-bit integer multiplication
• Systolic array architecture
- Systolic execution saves energy by reducing reads
and writes of the Unified Buffer
- Activation data flows in from the left and weights
are pre-loaded from the top
- A given 256-element multiply-accumulate operation
moves through the matrix as a diagonal wavefront
- Throughput-oriented: control and data are pipelined
18

TPU v1 Performance
• CPU vs GPU vs TPU
19
Operational Intensity: MAC ops/weight byte
Teraops/sec
Google TPU
nVidia K80
Intel Haswell
GM: geometric mean, WM: weighted mean

TPU v2 & TPU v3
• 128 x 128 systolic array (22.5 TFLOPS per core)
• float32 accumulate / bfloat16 multiplies
• 2 cores + 2 HBMs per chip / 4 chips per board
20TPU v2 TPU v3

Cloud TPU v2 Pod
• Single board: 180TFLOPS + 64GB HBM
• Single pod (64 boards):11.5 PFLOPS + 4TB HBM
• 2D torus topology
21
Single board

Cloud TPU v3 Pod
• > 100 PFLOPS
• 32TB HBM
22

Habana Labs Gaudi
• Build for AI training performance at scale
- High throughput at low batch size
- High power efficiency
• Enable native Ethernet scale-out
- On-chip RDMA over Converged Ethernet (RoCE v2)
- Reduced system complexity, cost and power
- Leverage widely used standard Ethernet switches
• Promote standard form factors
- Open Compute Project Accelerator Module (OAM)
• SW infrastructure and tools
- Frameworks and ML compilers support
- TPC kernel library and user-friendly dev tools to enable
optimization/customization
23

Gaudi Processor Architecture
24
• 500mm2 die @ TSMC 16nm
• TPC 2.0 (Tensor Processing Core)
- Support DL training & inference
- VLIW SIMD (C-programmable)
• GEMM operation engine
- Highly configurable
• PCIe Gen4.0 x16
• 4 HBMs
- 2GT/s, 1TB/s BW, 32GB capacity
• RoCE v2
- 10 ports of 100Gb or 20 ports of 50Gb
• Mixed-precision data types
- FP32, BF16, INT32/16/8, UINT32/16/8
Heterogenous compute architecture
256GB/s, 8GB
per HBM

Gaudi Software Platform
25
Automatic floating-to-fixed quantization
with near-zero accuracy loss
User’s
custom
model
Host side
Device side

Training System with Gaudi
26
Various network configurations & systems possible for scale-out training
Habana Labs Systems-1 (HLS-1)
High-performance system with 16 Gaudi cards

Data & Model Parallelism using Gaudi
27
Topology for Data Parallelism Topology for Model Parallelism
• 3 reduction levels
• 8x11x12 = 1056 Gaudi cards
• Model parallelism requires more bandwidth
• Large-scale systems are built with all-to-all
connectivity utilizing a single networking
hop thanks to Ethernet integration
• 8x8 = 64 Gaudi cards

Gaudi Training Performance
28
• ResNet-50 Training Throughput
Images-per-second(thousands)
# of processors
Gaudi vs. V100
#ofGaudichipsused
Images-per-second (thousands)
Habana Gaudi
A single Gaudi dissipates 140 Watt, processes 1650 images/second

29
Processor Gaudi HL-2000
Host Interface PCIe Gen 4.0 x 16
Memory 32GB HBM2
Memory BW 1 TB/s
ECC Protected Yes
Max Power Consumption 300W
Interconnect
2 Tbps: 20 56Gbps PAM4 Tx/Rx Serdes
(RoCE RDMA 10x100GbE or 20x 50GbE/25GbE)
System 8x Gaudi (HL-205)
Host Interface 4 ports of x16 PCIe Gen 4.0
Memory 256GB HBM2
Memory BW 8 TB/s
ECC Protected Yes
Max Power Consumption 3 kW
Interconnect
24 x 100Gbps RoCE v2 RDMA Ethernet ports
(6 x QSFP-DD)
Gaudi Mezzanine card & System
HL-205
HLS-1

• MIMD architecture for fine-grained parallelism
• 23.6B transistors, 800mm2 die @ 16nm
• 124.5 TFlops @ 120W (FP16 mul + FP32 acc)
• 1216 tiles (each tile = core + scratchpad)
- Support 7296 in total (6 per tile)
- 304 MB total memory (256 KB per tile)
- 45 TB/s memory BW & 6 cycle latency
- No shared memory
• PCIe Gen4 x16
- 64 GB/s bidirectional BW to host
• IPU-Exchange
- 8 TB/s all-to-all IPU tiles
- Non-blocking, any communication pattern
• IPU-Links
- 80 IPU-Links
- 320 GB/s chip-to-chip BW
Graphcore IPU Processor
30
TileIPU-Link
PCIe
IPU-
Exchange

IPU Tile
31
• Tile = computing core + 256 KB scratch pad
• Specialized pipelines called Accumulating
Matrix Product (AMP)
• AMP unit can accelerate matrix multiplication
and convolution operation
• The IPU tiles can be used for MIMD parallelism
Codelets exchange compute waiting

Building Multi-IPU Systems
32
IPU Processor
IPU PCIe card
(2 chips)
IPU server
(8 cards)
80 IPU-Links

POPLAR Software Development Kit
33
Standard ML
Frameworks
Graph Toolchain
for IPU
IPU Servers &
Systems
High-level
Compiler

Benchmark Performance
34
• BERT (NLP) Training
25% faster
• Dense Autoencoder Training
2.3x higher
• MCMC Probabilistic Model Training
15.2x faster
• Reinforcement Learning Policy Training
~13x higher

35
2x higher throughput @ similar
latency of Nvidia V100
• BERT Inference
6x higher throughput @ 22x lower latency
3.7 higher throughput @ 10x lower latency
• ResNetXt-101 Inference

Baidu Kunlun
36
• Cloud-to-edge AI chip
• Programmable FPGA Accelerator
(>30x faster than previous)
• Samsung 14nm Technology
• XPU core
• Pre-trained NLP model (Ernie)
• I-Cube 2.5D packaging
• In-Processor-Memory
- 16MB SRAM/unit
• 2 HBMs (512GB/s)
• PCIe Gen 4.0 x8 (32GB/s)
• 260TOPS@150W

XPU Core Architecture
37
• Many tiny cores
- Instruction set based software-programmable
- Domain specific ISA
- No operating system & no cache
- Flexible to serve diverse workloads
• Customized logic
- Hardware-reconfigurable
- Achieve high performance efficiency
- SDA-II accelerator can be used for DL
• Resource allocation is reconfigurable
- Set the ratio of cores vs. custom logic
depending on application’s requirement

XPU: Architecture of Tiny Cores
38
• 32 cores are clustered and share
- 32KB multi-bank memory
- SFA (special function accelerator)

XPU: Architecture of Tiny Cores
39
• MIPS-like instruction set
• Private scratchpad memory
- 16 KB or 32 KB
• 4-stage pipeline
- Designed for low latency
- Branch history table (BHT)

40
0
200
400
600
800
1000
Kunlun(Int16) Nvdia T4(FP16)
QPS(queryperseconds)
1200
1400
BERT Inference
3x higher throughput
than Nvdia T4
1.7x faster than Nvdia T4
0
50
100
150
200
250
GEMM-Int8
Kunlun Nvdia T4 CPU P4
TOPS
0
10
20
30
40
50
Kunlun(Int16) Nvdia T4(FP16)
QPS(queryperseconds)
70
80
60
90
Yolo v3
• XPU (256 tiny cores, SDA-II @ 600MHz)
1.2x faster than Nvdia T4

• TSMC 16nm technology
• 1.2T transistors on 46,225 mm2 silicon wafer
• 400,000 AI optimized cores
• 18 GB on-chip memory (SRAM)
- 9.6 PB/s memory BW
• Memory architecture optimized for DL
- Memory uniformly distributed across cores
• High-bandwidth low-latency interconnect
- 2D mesh topology
- Hardware based communication
- 100 Pbit/s fabric bandwidth
• 1 GHz clock speed & 15kW power consumption
• Largest chip ever built
Cerebras Wafer Scale Engine (WSE)
41

• Fully programmable compute core
• Full array of general instructions with ML extensions
• Flexible general operations for control processing
- E.g. arithmetic, logical, load/store, branch
• Optimized for tensor operations
- Tensors as first class operands
• Sparsity harvesting technology
- SLA cores intelligently skip the zeros
- All zeros are filtered out
Cerebras Wafer Scale Engine Core
42
Sparse Linear Algebra (SLA) Core

• Neural network(NN) models expressed in common ML frameworks
• Cerebras interface to framework extracts the NN
• Performs placement and routing to map NN layers to fabric
• The entire wafer operates on the single neural network
Programming the Wafer Scale Engine
43

44
Challenges of WSE
• Cross die connectivity
- Add cross-die wires across scribe lines of wafer in partnership with TSMC
• Yield
- Have redundant cores and reconnect fabric
• Wafer-wide package assembly technology
• Power and cooling

Intel NNP-T
45
• Intel Nervana Neural Network Processor
for Training (NNP-T)
• Train a network as fast as possible within
a given power budget, targeting larger
models and datasets
• Balance between compute, communication,
and memory for system performance
• Reuse on-die data as much as possible
• Optimize for batched workloads
• Build-in scale-out support

Intel NNP-T Architecture
46
• 27B transistors, 680mm2 die @
TSMC 16nm (2.5D packaging)
• 24 Tensor Processor Clusters
- Up to 119 TOPS
• 60MB on-chip distributed
memory
• 4 x HBM2
- 1.22 TB/s BW, 8GB capacity
• PCIe Gen 4.0 x16
• Up to 1.1GHz core frequency
• 64 lanes SerDes
- Inter-chip communication

Tensor Processing Cluster (TPC)
47

Compute Core
48
• Bfloat16 matrix multiply core (32x32)
• FP32 & BF16 support for all other
operations
• 2x multiply cores per TPC to amortize other
SoC resources (control, memory, network)
• Vector operations for non-GEMM
- Compound pipeline
- DL specific optimizations
• Activation functions, reductions, random-
number generation & accumulations
• Programmable FP32 loop-up tables

• Bidirectional 2-D mesh architecture
to allow any to any communication
• Cut-through forwarding and multi-
cast support
• 2.6 TB/s total cross-sectional BW
• HBM & SerDes are shared through
the mesh
• Support for direct peer-to-peer
communications between TPCs
NNP-T On-Die Communication
49

• Full software stack built with open components
• Direct integration with DL frameworks
• nGraph
- Hardware agnostic DL library & compiler
- Provides common set of optimizations for NNP-T
• Argon
- NNP-T DNN compute & communication kernel
library
• Low-level programmability
- NNP-T kernel development toolchain w/ tensor
compiler
NNP-T Software Stack
50
Argon
DNN Kernel Library
Kernel Mode Driver
Board Firmware Chip Firmware

51
Description Utilization
c64xh56xw56_k64xr3xs3_st1_n128 86%
c128xh28xw28_k128xr3xs3_st1_n128 71%
c512xh28xw28_k128xr1xs1_st1_n128 65%
c128xh28xw28_k512xr1xs1_st1_n128 59%
c256xh14xw14_k1024xr1xs1_st1_n128 62%
c256xh28xw28_k512xr1xs1_st2_n128 71%
c32xh120xw120_k64xr5xs5_st1_n128 87%
C=# input dimensions, H=height, W=width, K=# filters,
R=filter X, S=filter Y, ST=stride N=minibatch size
GEMM Size Utilization
1024 x 700 x 512 31.1%
1760 x 7133 x 1760 44.5%
2048 x 7133 x 2048 46.7%
2560 x 7133 x 2560 57.1%
4096 x 7133 x 4096 57.4%
5124 x 9124 x 2048 55.5%
Convolution operationGEMM operation

Summary
52
• Cloud AI accelerators’ goals
- High cost-performance over GPU, scalability, programmability
• Compute
- Specialized cores for tensor processing such as matrix, convolution
• Memory
- HBM
- Distributed on-chip memory & scratchpads
- No hardware caches
• Communications
- High bandwidth on-chip networks
- Custom inter-chip links
- PCIe Gen 4.0 to host
• Software
- Compatibility to existing frameworks (ONNX, TensorFlow, PyTorch)
- Graph compiler + device-oriented optimization

References
53
- https://www.hotchips.org/hc31/HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf
- https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf
- https://www.graphcore.ai/products/ipu
- https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-pcie-card-at-dell-tech-world/
- Z. Jia, “Dissecting the Graphcore IPU Architecture via Microbenchmarking,” Citadel technical report, 2019
- V. Rege, “Graphcore, the Need for New Hardware for Artificial Intelligence,” AI Hardware Summit 2019
- https://www.graphcore.ai/posts/new-graphcore-ipu-benchmarks
- https://m.itbiznews.com/news/newsview.php?ncode=1065569594387854
- https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.21-Monday-Pub/HC29.21.40-
Processors-Pub/HC29.21.410-XPU-FPGA-Ouyang-Baidu.pdf
- https://www.firstxw.com/view/254356.html
- https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf
- https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
- https://www.hotchips.org/hc31/HC31_1.12_Intel_Intel.AndrewYang.v0.92.pdf
- https://www.businesswire.com/news/home/20191112005277/en/Intel-Speeds-AI-Development-
Deployment-Performance-New

AI Accelerators for Cloud Datacenters

More Related Content

What's hot

Similar to AI Accelerators for Cloud Datacenters

Recently uploaded

AI Accelerators for Cloud Datacenters