1
Confidential
2
A
M
L
I
Lviv
R&D Lab
Confidential
CPU / GPU / TPU architecture
Dov Nimratz
Senior Solution Architect
3
A
M
L
I
Lviv
R&D Lab
Confidential
1. Historical context
2. CPU architecture
3. GPU architecture - Gin in a bottle for artificial intelligence
4. TPU architecture - AI dedicated processing chip
5. Next technological step
Table of contents
4
Confidential
Historical context
5
A
M
L
I
Lviv
R&D Lab
Confidential
● Discovered by Harvard Eikon in 1930
● Separate storage and signal
pathways for instructions and data.
● Frequently using in DSP
Harvard architecture
6
A
M
L
I
Lviv
R&D Lab
Confidential
1. The principle of duality.
2. The principle of program management.
3. The principle of homogeneity of memory.
4. The principle of memory addressability.
5. The principle of sequential program control.
6. The principle of conditional transition.
It makes "programs that write programs" possible
von Neumann architecture
SISD (.Single Instruction, Single Data) Architecture
7
A
M
L
I
Lviv
R&D Lab
Confidential
● Single instruction stream, single data
stream (SISD) - Traditional uniprocessor
machines like older personal computers
● Single instruction stream, multiple data
streams (SIMD) - Most common style of
parallel programming.
● Multiple instruction streams, single data
stream (MISD) - Uncommon architecture
which is generally used for fault tolerance.
● Multiple instruction streams, multiple data
streams (MIMD) - Distributed & multi core
systems
Flynn data / instruction classification models
8
Confidential
CPU architecture
9
A
M
L
I
Lviv
R&D Lab
Confidential
CISC / RISC / MISC / VLIM CPUs
CISC RISK
Emphasis on hardware Emphasis on software
Multiple instruction size and format Instructions of the same set with few
formats
Less registers Uses more registers
More addressing models Few addressing models
Extensive use of microprogramming Complexity in compiler
Instructions take a very amount of cycles Instructions take one cycle time
Pipeline is difficult Pipeline is easy
10
A
M
L
I
Lviv
R&D Lab
Confidential
CPU architecture
● The main task of the CPU is to execute a
chain of instructions in the shortest
possible time.
● The CPU may execute several chains at
the same time.
● After executing them separately, merge
them again into one, in the correct order.
● Each instruction in the stream depends
on the instructions following it.
11
A
M
L
I
Lviv
R&D Lab
Confidential
ALU architecture
12
Confidential
GPU architecture -
Gin in a bottle for artificial intelligence
13
A
M
L
I
Lviv
R&D Lab
Confidential
Task parallel:
– Independent processes with little communication
– Easy to use
Data parallel:
– Lots of data on which the same computation is being executed – No
dependencies between data elements in each computation step
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
Task vs Data parallelism
14
A
M
L
I
Lviv
R&D Lab
Confidential
CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
● CPUs are great for task parallelism
● GPUs are great for data parallelism
CPU vs GPU (GPGPU)
15
A
M
L
I
Lviv
R&D Lab
Confidential
– Large data sets
– High parallelism
– Minimal dependencies between data elements
– High arithmetic intensity
– Lots of work to do without CPU intervention
Ideal apps to target GPGPU
16
A
M
L
I
Lviv
R&D Lab
Confidential
Graphic pipeline in GPU
17
A
M
L
I
Lviv
R&D Lab
Confidential
– Functions applied to each element in stream
• transforms, PDE, ...
– No dependencies between stream elements
• Encourage high Arithmetic Intensity
Kernels
18
A
M
L
I
Lviv
R&D Lab
Confidential
● SIMD (single instruction, multiple data)
● 8-16 Stream core in each processor
● PE (process element) / ALU
GPGPU block diagram
19
A
M
L
I
Lviv
R&D Lab
Confidential
Patterns
20
A
M
L
I
Lviv
R&D Lab
Confidential
Deep Learning Neural Network
Three kinds of NNs are popular today:
1. Multi-Layer Perceptrons (MLP): Each new layer is a set of
nonlinear functions of weighted sum of all outputs (fully
connected) from a prior one, which reuses the weights.
2. Convolutional Neural Networks (CNN): Each ensuing layer is
a set of of nonlinear functions of weighted sums of spatially
nearby subsets of outputs from the prior layer, which also
reuses the weights.
3. Recurrent Neural Networks (RNN): Each subsequent layer is
a collection of nonlinear functions of weighted sums of
outputs and the previous state. The most popular RNN
is Long Short-Term Memory (LSTM). The art of the LSTM is
in deciding what to forget and what to pass on as state to the
next layer. The weights are reused across time steps.
21
A
M
L
I
Lviv
R&D Lab
Confidential
● CLB (Configurable Logic Block): These are the basic cells of
FPGA. It consists of one 8-bit function generator, two 16-bit
function generators, two registers (flip-flops or latches), and
reprogrammable routing controls (multiplexers). The CLBs are
applied to implement other designed function and macros. Each
CLBs have inputs on each side which makes them flexile for the
mapping and partitioning of logic.
● I/O Pads or Blocks: The Input/Output pads are used for the
outside peripherals to access the functions of FPGA and using the
I/O pads it can also communicate with FPGA for different
applications using different peripherals.
● Switch Matrix/ Interconnection Wires: Switch Matrix is used in
FPGA to connect the long and short interconnection wires
together in flexible combination. It also contains the transistors to
turn on/off connections between different lines.
FPGA architecture
22
Confidential
TPU architecture -
AI dedicated processing chip
23
A
M
L
I
Lviv
R&D Lab
Confidential
TPU block diagram
● Instructions come true 3x16 PCI
● MMU - 256x256 by 8 bit mul-add integers
● Accumulator - 4MiB = 4Kx256x8b
● Matrix unit produces one 256-element partial
sum per clock cycle
● PCI functionality:
○ reads data from the CPU host memory
into the Unified Buffer(UB)
○ reads weights from Weight Memory into
the Weight FIFO as input to the Matrix
Unit.
○ Order Matrix Unit to perform a matrix
multiply or a convolution from the
Unified Buffer into the Accumulators.
● Activate performs the non linear function of
the artificial neuron, with options for ReLU,
Sigmoid. It can also perform the pooling
operations
24
A
M
L
I
Lviv
R&D Lab
Confidential
Matrix operation
Weights
Data
● Given 256-element multiply-accumulate
operation moves through the matrix as a
diagonal wavefront.
● Weights are preloaded, and take effect
with the advancing wave alongside the
first data of a new block.
● Control and data are pipelined to give
the illusion that the 256 inputs are read
at once
25
A
M
L
I
Lviv
R&D Lab
Confidential
Memory subsystem Architecture
26
Confidential
Next technological step
CONFIDENTIAL
GPUCPU
Analog
processors
Distributed
inference
Cognitive
computing
Number of Semiconductor Elements per 1
process module
109
101
103
106
108
1012
102
1010
Today
103
NN CNN
Optimized
Models
TPU
Number of process
modules per system
AI Technology Trend
28
A
M
L
I
Lviv
R&D Lab
Confidential
● No chip size limitation
● Fixed NN graph
● Each weight represented by analog
memory
● Nonlinear function of memorization
and forgetting
Analog memory matrix
N
R
N
R
Ro Ro
R1 R1
Training Forgetting
NN Graph
Graph Anchor
29
Questions?
Skype: dovnmr
E-mail: dov.nimratz@globallogic.com
30
Thank You

Architecture of TPU, GPU and CPU

  • 1.
  • 2.
    2 A M L I Lviv R&D Lab Confidential CPU /GPU / TPU architecture Dov Nimratz Senior Solution Architect
  • 3.
    3 A M L I Lviv R&D Lab Confidential 1. Historicalcontext 2. CPU architecture 3. GPU architecture - Gin in a bottle for artificial intelligence 4. TPU architecture - AI dedicated processing chip 5. Next technological step Table of contents
  • 4.
  • 5.
    5 A M L I Lviv R&D Lab Confidential ● Discoveredby Harvard Eikon in 1930 ● Separate storage and signal pathways for instructions and data. ● Frequently using in DSP Harvard architecture
  • 6.
    6 A M L I Lviv R&D Lab Confidential 1. Theprinciple of duality. 2. The principle of program management. 3. The principle of homogeneity of memory. 4. The principle of memory addressability. 5. The principle of sequential program control. 6. The principle of conditional transition. It makes "programs that write programs" possible von Neumann architecture SISD (.Single Instruction, Single Data) Architecture
  • 7.
    7 A M L I Lviv R&D Lab Confidential ● Singleinstruction stream, single data stream (SISD) - Traditional uniprocessor machines like older personal computers ● Single instruction stream, multiple data streams (SIMD) - Most common style of parallel programming. ● Multiple instruction streams, single data stream (MISD) - Uncommon architecture which is generally used for fault tolerance. ● Multiple instruction streams, multiple data streams (MIMD) - Distributed & multi core systems Flynn data / instruction classification models
  • 8.
  • 9.
    9 A M L I Lviv R&D Lab Confidential CISC /RISC / MISC / VLIM CPUs CISC RISK Emphasis on hardware Emphasis on software Multiple instruction size and format Instructions of the same set with few formats Less registers Uses more registers More addressing models Few addressing models Extensive use of microprogramming Complexity in compiler Instructions take a very amount of cycles Instructions take one cycle time Pipeline is difficult Pipeline is easy
  • 10.
    10 A M L I Lviv R&D Lab Confidential CPU architecture ●The main task of the CPU is to execute a chain of instructions in the shortest possible time. ● The CPU may execute several chains at the same time. ● After executing them separately, merge them again into one, in the correct order. ● Each instruction in the stream depends on the instructions following it.
  • 11.
  • 12.
    12 Confidential GPU architecture - Ginin a bottle for artificial intelligence
  • 13.
    13 A M L I Lviv R&D Lab Confidential Task parallel: –Independent processes with little communication – Easy to use Data parallel: – Lots of data on which the same computation is being executed – No dependencies between data elements in each computation step – Can saturate many ALUs – But often requires redesign of traditional algorithms Task vs Data parallelism
  • 14.
    14 A M L I Lviv R&D Lab Confidential CPU – Reallyfast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads – High performance on a single thread of execution GPU – Lots of math units – Fast access to onboard memory – Run a program on each fragment/vertex – High throughput on parallel tasks ● CPUs are great for task parallelism ● GPUs are great for data parallelism CPU vs GPU (GPGPU)
  • 15.
    15 A M L I Lviv R&D Lab Confidential – Largedata sets – High parallelism – Minimal dependencies between data elements – High arithmetic intensity – Lots of work to do without CPU intervention Ideal apps to target GPGPU
  • 16.
  • 17.
    17 A M L I Lviv R&D Lab Confidential – Functionsapplied to each element in stream • transforms, PDE, ... – No dependencies between stream elements • Encourage high Arithmetic Intensity Kernels
  • 18.
    18 A M L I Lviv R&D Lab Confidential ● SIMD(single instruction, multiple data) ● 8-16 Stream core in each processor ● PE (process element) / ALU GPGPU block diagram
  • 19.
  • 20.
    20 A M L I Lviv R&D Lab Confidential Deep LearningNeural Network Three kinds of NNs are popular today: 1. Multi-Layer Perceptrons (MLP): Each new layer is a set of nonlinear functions of weighted sum of all outputs (fully connected) from a prior one, which reuses the weights. 2. Convolutional Neural Networks (CNN): Each ensuing layer is a set of of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. 3. Recurrent Neural Networks (RNN): Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state. The most popular RNN is Long Short-Term Memory (LSTM). The art of the LSTM is in deciding what to forget and what to pass on as state to the next layer. The weights are reused across time steps.
  • 21.
    21 A M L I Lviv R&D Lab Confidential ● CLB(Configurable Logic Block): These are the basic cells of FPGA. It consists of one 8-bit function generator, two 16-bit function generators, two registers (flip-flops or latches), and reprogrammable routing controls (multiplexers). The CLBs are applied to implement other designed function and macros. Each CLBs have inputs on each side which makes them flexile for the mapping and partitioning of logic. ● I/O Pads or Blocks: The Input/Output pads are used for the outside peripherals to access the functions of FPGA and using the I/O pads it can also communicate with FPGA for different applications using different peripherals. ● Switch Matrix/ Interconnection Wires: Switch Matrix is used in FPGA to connect the long and short interconnection wires together in flexible combination. It also contains the transistors to turn on/off connections between different lines. FPGA architecture
  • 22.
    22 Confidential TPU architecture - AIdedicated processing chip
  • 23.
    23 A M L I Lviv R&D Lab Confidential TPU blockdiagram ● Instructions come true 3x16 PCI ● MMU - 256x256 by 8 bit mul-add integers ● Accumulator - 4MiB = 4Kx256x8b ● Matrix unit produces one 256-element partial sum per clock cycle ● PCI functionality: ○ reads data from the CPU host memory into the Unified Buffer(UB) ○ reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit. ○ Order Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. ● Activate performs the non linear function of the artificial neuron, with options for ReLU, Sigmoid. It can also perform the pooling operations
  • 24.
    24 A M L I Lviv R&D Lab Confidential Matrix operation Weights Data ●Given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. ● Weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. ● Control and data are pipelined to give the illusion that the 256 inputs are read at once
  • 25.
  • 26.
  • 27.
    CONFIDENTIAL GPUCPU Analog processors Distributed inference Cognitive computing Number of SemiconductorElements per 1 process module 109 101 103 106 108 1012 102 1010 Today 103 NN CNN Optimized Models TPU Number of process modules per system AI Technology Trend
  • 28.
    28 A M L I Lviv R&D Lab Confidential ● Nochip size limitation ● Fixed NN graph ● Each weight represented by analog memory ● Nonlinear function of memorization and forgetting Analog memory matrix N R N R Ro Ro R1 R1 Training Forgetting NN Graph Graph Anchor
  • 29.
  • 30.