Architecture of TPU, GPU and CPU

2
A
M
L
I
Lviv
R&D Lab
Confidential
CPU / GPU / TPU architecture
Dov Nimratz
Senior Solution Architect

3
A
M
L
I
Lviv
R&D Lab
Conﬁdential
1. Historical context
2. CPU architecture
3. GPU architecture - Gin in a bottle for artificial intelligence
4. TPU architecture - AI dedicated processing chip
5. Next technological step
Table of contents

4
Confidential
Historical context

5
A
M
L
I
Lviv
R&D Lab
Conﬁdential
● Discovered by Harvard Eikon in 1930
● Separate storage and signal
pathways for instructions and data.
● Frequently using in DSP
Harvard architecture

6
A
M
L
I
Lviv
R&D Lab
Conﬁdential
1. The principle of duality.
2. The principle of program management.
3. The principle of homogeneity of memory.
4. The principle of memory addressability.
5. The principle of sequential program control.
6. The principle of conditional transition.
It makes "programs that write programs" possible
von Neumann architecture
SISD (.Single Instruction, Single Data) Architecture

7
A
M
L
I
Lviv
R&D Lab
Conﬁdential
● Single instruction stream, single data
stream (SISD) - Traditional uniprocessor
machines like older personal computers
● Single instruction stream, multiple data
streams (SIMD) - Most common style of
parallel programming.
● Multiple instruction streams, single data
stream (MISD) - Uncommon architecture
which is generally used for fault tolerance.
● Multiple instruction streams, multiple data
streams (MIMD) - Distributed & multi core
systems
Flynn data / instruction classification models

8
Confidential
CPU architecture

9
A
M
L
I
Lviv
R&D Lab
Conﬁdential
CISC / RISC / MISC / VLIM CPUs
CISC RISK
Emphasis on hardware Emphasis on software
Multiple instruction size and format Instructions of the same set with few
formats
Less registers Uses more registers
More addressing models Few addressing models
Extensive use of microprogramming Complexity in compiler
Instructions take a very amount of cycles Instructions take one cycle time
Pipeline is difficult Pipeline is easy

10
A
M
L
I
Lviv
R&D Lab
Conﬁdential
CPU architecture
● The main task of the CPU is to execute a
chain of instructions in the shortest
possible time.
● The CPU may execute several chains at
the same time.
● After executing them separately, merge
them again into one, in the correct order.
● Each instruction in the stream depends
on the instructions following it.

11
A
M
L
I
Lviv
R&D Lab
Conﬁdential
ALU architecture

12
Confidential
GPU architecture -
Gin in a bottle for artificial intelligence

13
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Task parallel:
– Independent processes with little communication
– Easy to use
Data parallel:
– Lots of data on which the same computation is being executed – No
dependencies between data elements in each computation step
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
Task vs Data parallelism

14
A
M
L
I
Lviv
R&D Lab
Conﬁdential
CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
● CPUs are great for task parallelism
● GPUs are great for data parallelism
CPU vs GPU (GPGPU)

15
A
M
L
I
Lviv
R&D Lab
Conﬁdential
– Large data sets
– High parallelism
– Minimal dependencies between data elements
– High arithmetic intensity
– Lots of work to do without CPU intervention
Ideal apps to target GPGPU

16
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Graphic pipeline in GPU

17
A
M
L
I
Lviv
R&D Lab
Conﬁdential
– Functions applied to each element in stream
• transforms, PDE, ...
– No dependencies between stream elements
• Encourage high Arithmetic Intensity
Kernels

18
A
M
L
I
Lviv
R&D Lab
Conﬁdential
● SIMD (single instruction, multiple data)
● 8-16 Stream core in each processor
● PE (process element) / ALU
GPGPU block diagram

19
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Patterns

20
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Deep Learning Neural Network
Three kinds of NNs are popular today:
1. Multi-Layer Perceptrons (MLP): Each new layer is a set of
nonlinear functions of weighted sum of all outputs (fully
connected) from a prior one, which reuses the weights.
2. Convolutional Neural Networks (CNN): Each ensuing layer is
a set of of nonlinear functions of weighted sums of spatially
nearby subsets of outputs from the prior layer, which also
reuses the weights.
3. Recurrent Neural Networks (RNN): Each subsequent layer is
a collection of nonlinear functions of weighted sums of
outputs and the previous state. The most popular RNN
is Long Short-Term Memory (LSTM). The art of the LSTM is
in deciding what to forget and what to pass on as state to the
next layer. The weights are reused across time steps.

21
A
M
L
I
Lviv
R&D Lab
Confidential
● CLB (Configurable Logic Block): These are the basic cells of
FPGA. It consists of one 8-bit function generator, two 16-bit
function generators, two registers (flip-flops or latches), and
reprogrammable routing controls (multiplexers). The CLBs are
applied to implement other designed function and macros. Each
CLBs have inputs on each side which makes them flexile for the
mapping and partitioning of logic.
● I/O Pads or Blocks: The Input/Output pads are used for the
outside peripherals to access the functions of FPGA and using the
I/O pads it can also communicate with FPGA for different
applications using different peripherals.
● Switch Matrix/ Interconnection Wires: Switch Matrix is used in
FPGA to connect the long and short interconnection wires
together in flexible combination. It also contains the transistors to
turn on/off connections between different lines.
FPGA architecture

22
Confidential
TPU architecture -
AI dedicated processing chip

23
A
M
L
I
Lviv
R&D Lab
Conﬁdential
TPU block diagram
● Instructions come true 3x16 PCI
● MMU - 256x256 by 8 bit mul-add integers
● Accumulator - 4MiB = 4Kx256x8b
● Matrix unit produces one 256-element partial
sum per clock cycle
● PCI functionality:
○ reads data from the CPU host memory
into the Unified Buffer(UB)
○ reads weights from Weight Memory into
the Weight FIFO as input to the Matrix
Unit.
○ Order Matrix Unit to perform a matrix
multiply or a convolution from the
Unified Buffer into the Accumulators.
● Activate performs the non linear function of
the artificial neuron, with options for ReLU,
Sigmoid. It can also perform the pooling
operations

24
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Matrix operation
Weights
Data
● Given 256-element multiply-accumulate
operation moves through the matrix as a
diagonal wavefront.
● Weights are preloaded, and take effect
with the advancing wave alongside the
first data of a new block.
● Control and data are pipelined to give
the illusion that the 256 inputs are read
at once

25
A
M
L
I
Lviv
R&D Lab
Conﬁdential
Memory subsystem Architecture

26
Confidential
Next technological step

CONFIDENTIAL
GPUCPU
Analog
processors
Distributed
inference
Cognitive
computing
Number of Semiconductor Elements per 1
process module
109
101
103
106
108
1012
102
1010
Today
103
NN CNN
Optimized
Models
TPU
Number of process
modules per system
AI Technology Trend

28
A
M
L
I
Lviv
R&D Lab
Conﬁdential
● No chip size limitation
● Fixed NN graph
● Each weight represented by analog
memory
● Nonlinear function of memorization
and forgetting
Analog memory matrix
N
R
N
R
Ro Ro
R1 R1
Training Forgetting
NN Graph
Graph Anchor

29
Questions?
Skype: dovnmr
E-mail: dov.nimratz@globallogic.com

Architecture of TPU, GPU and CPU

More Related Content

What's hot

Similar to Architecture of TPU, GPU and CPU

More from GlobalLogic Ukraine

Recently uploaded

Architecture of TPU, GPU and CPU