Understanding why Artificial Intelligence will become the most prevalent server workload by 2020

Understanding why AI will become the most prevalent server
workload by 2020
Rob Farber
CEO TechEnablement.com
Contact info at techenablement.com for consulting, teaching, writing, and other inquiries

Machine learning has redefined the market
“In the near future every piece of data in the data center will be
interacted with by AI” – Ian Buck (VP Accelerated Computing, NVIDIA)
“By 2020 servers will run data analytics more than any other
workload” – Diane Bryant (VP and GM of the Data Center Group,
Intel)

Why? “Computational Universality” via training!
• The famous XOR problem nicely emphasizes the
importance of hidden neurons
• Networks with hidden units can implement all
Boolean functions used to build a computer
Computational Universal Machine
Learning!
• Networks without nonlinear hidden units cannot
learn XOR hence are not computationally universal
• Cannot represent large classes of problems
G(x)
NetTalk
Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel
network that learns to read aloud, Cognitive Science, 14, 179-211
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
500 learning loops Finished
"Applications of Neural Net
and Other Machine Learning
Algorithms to DNA Sequence
Analysis", (1989).
“How Neural Networks
work", (Lapedes,Farber
1987).

Deep-Learning (learn from data many of the
things we do)
Speech recognition in noisy
environments (Siri, Cortana,
Google, Baidu, …)
Better than human
accuracy face
recognition
Self-driving cars
• Internet Search • Robotics • Self guiding drones • Much, much, more

Speech recognition is a Bellwether
A driving force
for ubiquitous
inferencing in
the data center

Expect amazing growth “$10T incremental value”
with 1000x increase in data volume
• CEO Saudi Telecom statement during his KAUST Global IT Keynote
• “We expect 5G to increase the volume of mobile data by 1,000x”
• $10T incremental value
Khalid Bin Hussein Bayari
CEO, Saudi Telecom
See also: http://www.mwc.gr/presentations/2016/kolokotronis.pdf and https://www.itu.int/en/ITU-T/Workshops-
and-Seminars/standardization/201603/Documents/Abstracts-Presentations/S2P3_Ali_Amer.pptx
Source: METIS

From NetTalk to Bioinformatics
Internal
connections
The phoneme to
be pronounced
NetTalk
Sejnowski, T. J. and Rosenberg, C. R. (1986)
NETtalk: a parallel network that learns to read
aloud, Cognitive Science, 14, 179-211
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural
_network)
Internal
connections
t te X A TC G T
"Applications of Neural Net and Other Machine
Learning Algorithms to DNA Sequence Analysis",
A.S. Lapedes, C. Barnes, C. Burks, R.M. Farber, K.
Sirotkin, Computers and DNA, SFI Studies in the
Sciences of Complexity, vol. VII, Eds. G. Bell and
T. Marr, Addison-Wesley, (1989).
T|F Exon region

From Bioinformatics to drug design
(The closer you look the greater the complexity)
Electron Microscope

We formed a company, then “The Question”
How do we know you
are not playing
expensive computer
games with our money?

Train then utilize a blind test
Internal
connections
A0
Binding
affinity for a
specific
antibody
A1 A2 A3 A4 A5
Possible hexamers
206 = 64M
1k – 2k pseudo-random
(hexamer, binding)
affinity pairs
Approx. 0.001%
sampling
“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan
Lapedes and Robert Farber, Los Alamos National Laboratory
Technical Report LA-UR-94-4391 (1994).

Hill climbing to find high affinity
Internal
connections
A0
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦'()*+,-.
A1 A2 A3 A4 A5
Learn:
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦'()*+,-. = 𝑓 𝐴0, …, 𝐴3
𝑓(F,F,F,F,F,F)
𝑓(F,F,F,F,F,L)
𝑓(F,F,F,F,F,V) 𝑓(F,F,F,F,L,L)
𝑓(P,C,T,N,S,L)
Predict P,C,T,N,S,L has the
highest binding affinity
Confirm
experimentally

Two important points
• The computer appears to correctly predict experimental data
• Demonstrated that complex binding affinity relationships can be
learned from a small set of samples
• Necessary because it is only possible to sample a very small subset of the
binding affinity landscape for drug candidates

Time series
Iterate
𝑋)56 = 𝑓 𝑋), 𝑋)76, 𝑋)78, …
𝑋)58 = 𝑓 𝑋)56, 𝑋), 𝑋)76, …
𝑋)59 = 𝑓 𝑋)58, 𝑋)56, 𝑋), …
𝑋)5: = 𝑓 𝑋)59, 𝑋)58, 𝑋)56, …
Internal
connections
Xt Xt-1
Learn:
𝑋)56 = 𝑓 𝑋), 𝑋)76, 𝑋)78, …
Xt-2 Xt-3 Xt-4 Xt-5
Xt+1
Works great! (better than other
methods at that time)
"How Neural Nets Work", A.S. Lapedes, R.M. Farber,
reprinted in Evolution, Learning, Cognition, and Advanced
Architectures, World Scientific Publishing. Co., (1987).

Pt+1
‘Sliding inference’ during training to increase
accuracy
XtXt-1
Internal
connections
Xt-2Xt-3Xt-4Xt-5 Xt+1 Xt+2 Xt+3
Pt+3
Error(example) = ∑ (𝑋)5*−𝑃)5*
9
*?6 )2
Pt+1
Pt+2Pt+2

Designing ANNs for Integration and
Bifurcation analysis – “training a netlet”
"Identification of Continuous-Time Dynamical Systems:
Neural Network Based Algorithms and Parallel
Implementation", R. M. Farber, A. S. Lapedes, R.
Rico-Martinez and I. G. Kevrekidis, Proceedings of the 6th
SIAM Conference on Parallel Processing for Scientific
Computing, Norfolk, Virginia, March 1993.
ANN schematic for continuous-time
identification. (a) A four-layered ANN based on
a fourth order Runge-Kutta integrator. (b) ANN
embedded in a simple implicit integrator.
(a) Periodic attractor of the Van der Pol oscillator for g= 1.0,
d= 4.0 and w = 1.0. The unstable steady state in the interior
of the curve is marked +. (b) ANN-based predictions for the
attractors of the Van der Pol oscillator shown in (a).

Dimension reduction
• The curse of dimensionality
• People cannot visualize data beyond 3D + color
• Search volume rapidly increases with dimension
• Queries return too much data or no data
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
Sensor 1 Sensor 2 Sensor 3 Sensor N
X Y Z

A general SIMD mapping:
Optimize(LMS_Error = objFunc(p1, p2, … pn))
17
Examples
0, N-1
Examples
N, 2N-1
Examples
2N, 3N-1
Examples
3N, 4N-1
Step 2
Calculate partials
Step1
Broadcast
parameters
Optimization Method
(Powell, Conjugate Gradient, Other)
Step 3
Sum partials to get
energy
GPU 1 GPU 2 GPU 3
p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn
GPU 4
Host

0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000 3500
Average Sustained TF/s
Number of Intel Xeon Phi coprocessors/Sandy Bridge nodes
TACC Stampede PCA scaling
Many problems are too big for a single computer –
Strong scaling execution model!
Perfect strong scaling decreases runtime linearly by
the number of processing elements
• O(LogN) scaling is “good enough"

See a path to exascale
(MPI can map to thousands of GPU or Processor nodes)
19
Always report “Honest Flops”

Expect significant algorithm and HW retooling
Today: “only 7% of all servers being used for machine learning and only 0.1% are
running deep neural nets” – Forbes
By 2020, 100% of servers running machine learning – Intel, NVIDIA, Wall Street
CPU GPU
FPGA Chips
Roughly four different camps

NVIDIA (GPUs)
• Restarted massive parallelism with CUDA and GPU computing
• Making big inroads into the data center

GPU Threads are grouped into threadblocks
• Threads can only communicate within a thread block
• (yes, there are atomic ops)
• Fast hardware scheduling
• Blks run when dependencies resolved
Data

• Blocks that are ready to run get assigned to processing elements
• Fast hardware scheduling
Scalability required to use all those cores
(strong scaling execution model)
Active Queue

Executables can run unchanged on bigger
GPUs
• Dealer Analogy
Scheduler SMX
Strong Scaling
Execution Model

NVIDIA Claims Big Perf. Increases since 2013

NVIDIA on speech recognition
NVIDIA Claims for P40 using INT8 math & TensorRT

Processor-based computing
AMD
Intel

Traditional Vector ISA
Corewide
SSE
wide
SSE
wide
SSE
wide
SSE
Core
Core
512 wide vector unit
512 wide vector unit
Core 512 wide vector unit
Core 512 wide vector unit
Illustration
Floating-point performance comes from the dual
per core vector units
• AVX-512 = 16 32-bit ops/clock

P
e
r
f
o
r
m
a
n
c
e Scalar and
single-threaded
Parallelism
No Parallelism Massive Parallelism
Vector and
single-threaded
Image courtesy
Elsevier
Highest
Performance

Convergence (for training and HPC in general)
• NVIDIA Pascal has a working MMU (Memory Management Unit)
• Data can be automatically be moved between CPU and GPU on a demand basis.
• Offload programming is no longer a requirement, its an optimization!
• This is a really big deal as code changes are a barrier to GPU adoption
• Pascal GPUs have fast stacked memory (Much faster, more capacity, energy efficient)
• and NVlink (fast host/GPU memory bandwidth transfer – but only with IBM!)
• Intel Xeon Phi (formerly known as Knights Landing):
• Data can be automatically moved between near (stacked) and far (DDR4) memory on a
demand basis using cache mode.
• Stacked memory is much faster, more capacity, energy efficient
• IBM
• Data can be automatically moved between Power and GPU on a demand basis using
NVlink.

IBM approach
• Sumit Gupta (VP, High Performance Computing and Data Analytics,
IBM), “fundamentally, accelerators are the path forward.”
• These accelerators are GPUs for compute, storage accelerators for big data
and FPGAs for special functions.
• Watson (of Jeopardy fame) for software
• TrueNorth
• Developed as part of the DARPA SyNAPSE program
• 46 Billion synaptic op/s using 70 mW!
Sumit Gupta

The OpenPower ‘special sauce’
1. CAPI (IBM Coherent Accelerator Processor Interface)
2. NVlink used in the CORAL "Summit" and "Sierra“
supercomputers
• Make application acceleration:
• Much easier
• Transparent for the application programmer.
• ‘Open’ for all to join

CAPI is important
• Supports compute, storage, and
special (e.g. FPGA) accelerators
• Shares virtual addressing –
everything works with same
memory addresses
• Provides hardware managed
cache coherence
• Claims to eliminate 97% of code
path length!

Data handling can take as much time as the
computational problem!
• ORNL Titan
– 112,128 GB of GPU memory in 18,688 K20x GPUs
• Data handling must be
– Language agnostic
– Scalable

OpenPower storage accelerators
• 56 Terabyte ‘extended’ memory on Power8 using flash
Databases are important to processing data for training sets and more

NoSQL (Not Only SQL) Databases
Server throughput in Ops/Sec (50/50 read/write ratio) Image courtesy IBM

SPEC M3 Benchmark on 68 TB CAPI system
• Fastest mean response times and most consistent response times (lowest
standard deviation) ever reported, for all combinations of query type, data
volume, and concurrent users.
• Each mean response time was 5.5x to 212x the previous best result,
including:
• 21x to 212x the performance of the previous best published result for the market
snap benchmarks (10T.YR[n]-MKTSNAP.TIME) *
• 21x the performance of the previous best published result for year high bid in the
smallest year of the dataset (1T.OLDYRHIBID.TIME) *
• 8-10x the performance of the previous best published result for the 100-user
volume-weighted average bid benchmarks (100T.YR[n]VWAB-12D-HO.TIME) **
• 5-8x the performance of the previous best published result for the N-year high-bid
benchmarks (1T.[n]YRHIBID.TIME)

Fast scalable data loads for training via
parallel file systems
Node
1
Node
2
Node
3
Node
4
Node
5
Node
6
Node
7
Node
500
Each MPI client on
each node:
1. Opens file
2. Seeks to location
3. Reads data
4. Close

Other training and inference solutions
FPGAs
• Fast, power efficient and perfect for variable precision arithmetic
• Accessible via CAPI
• Very difficult to program
Custom chips
• Fast, power efficient and perfect for variable precision arithmetic
Heatsink City
• A is quad Google TPU2 motherboard side
view
• B is dual IBM Power9 “Zaius” motherboard
• C is dual IBM Power8 “Minsky”
motherboard
• D is Dual Intel Xeon Facebook “Yosemite”
motherboard
• E is Nvidia P100 SMX2 module with heat
sink and Facebook “Big Basin” motherboard
(Image courtesy The Next Platform)
Nervana and many other offerings
Google TPU2:
• 15x-30x faster than CPU and GPU, “On
our production AI workloads that utilize
neural network inference”
• 30x to 80x improvement in TOPS/Watt
measure
• Exclusive to Google Cloud

IBM TrueNorth
(A path to the future???)
• PNAS (8/9/16) Convolutional networks for fast, energy-efficient
neuromorphic computing, Steven K. Essera et. al.
• Chip implements networks using integrate-and-fire spiking neurons
• IBM researchers ran the datasets at between 1,200 and 2,600 frames/s and
using between 25 and 275 mW (effectively >6,000 frames/s per watt)
• Can go really big
A system roughly the neuronal size of a rat brain

TrueNorth: Accuracy
• PNAS paper: “[We] demonstrate that neuromorphic
computing … can implement deep convolution networks
that approach state-of-the-art classification accuracy
across eight standard datasets encompassing vision and
speech, perform inference while preserving the hardware’s
underlying energy-efficiency and high throughput.”
Accuracy of different sized networks running on
one or more TrueNorth chips to perform
inference on eight datasets. For comparison,
accuracy of state-of-the-art unconstrained
approaches are shown as bold horizontal lines
(hardware resources used for these networks are
not indicated).

Common software
• Theano: A Python library that generates C-code for a CPU or GPU
• TensorFlow: Google’s open source library for machine learning
• cuDNN: NVIDIA’s machine learning library
• Intel DAAL (Data Analytics Library)
• Torch: An open source middleware library
• Caffe: Berkeley’s popular framework

Speaking of accuracy and software …
• Symbolically calculated gradients (Jacobians) are important as they greatly
assist the search for good solutions. Think L-BFGS, Conjugate Gradient, …
• Use of a gradient provides an algorithmic speedup that can achieve orders of
magnitude faster time-to-model as well as better solutions.
• Getting the gradient is pretty easy with popular software packages such as Theano.
• Big memory is required to perform the gradient calculation
• The size of the gradient gets very large, very fast as the number of parameters in the
ANN model increases.
• Memory capacity and bandwidth limitations (plus cache and potentially atomic
instruction performance) dominate the runtime of the gradient calculation.
• The size of the code can exceed GPU instruction memory capacity.
• Definitely a place for big memory many-core processors!
• Like Power and Nvlink as data is shared between all devices efficiently

Machine 1
App A
App B
App C
App D CPU Load-balancing
Splitter
App A
App B
App C
App D CPU
Machine 2
App A
App B
App C
App D CPU
Machine 3
Fast and scalable heterogeneous workflows
Full source code in my DDJ tutorial
http://www.drdobbs.com/parallel/232601605
Volta
FPGA
Custom Asic

So much more, you have been great, Thank You!
Rob Farber
CEO TechEnablement.com
Contact info at techenablement.com for consulting, teaching, writing, and other inquiries

Understanding why Artificial Intelligence will become the most prevalent server workload by 2020

More Related Content

What's hot

Similar to Understanding why Artificial Intelligence will become the most prevalent server workload by 2020

Recently uploaded

Understanding why Artificial Intelligence will become the most prevalent server workload by 2020