Understanding	why	AI	will	become	the	most	prevalent	server	
workload	by	2020
Rob	Farber	
CEO	TechEnablement.com	
Contact	info	at	techenablement.com for	consulting,	teaching,	writing,	and	other	inquiries
Machine	learning	has	redefined	the	market
“In	the	near	future	every	piece	of	data	in	the	data	center	will	be	
interacted	with	by	AI”	– Ian	Buck	(VP	Accelerated	Computing,	NVIDIA)
“By	2020	servers	will	run	data	analytics	more	than	any	other	
workload”	– Diane	Bryant	(VP	and	GM	of	the	Data	Center	Group,	
Intel)
Why?	“Computational	Universality”	via	training!
• The	famous	XOR	problem	nicely	emphasizes	the	
importance	of	hidden	neurons	
• Networks	with	hidden	units	can	implement	all	
Boolean	functions	used	to	build	a	computer
Computational	Universal	Machine	
Learning!
• Networks	without nonlinear	hidden	units	cannot	
learn	XOR	hence	are	not	computationally	universal
• Cannot	represent	large	classes	of	problems
G(x)
NetTalk
Sejnowski,	T.	J.	and	Rosenberg,	C.	R.	(1986)	NETtalk:	a	parallel	
network	that	learns	to	read	aloud,	Cognitive	Science,	14,	179-211	
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
500	learning	loops Finished
"Applications	of	Neural	Net	
and	Other	Machine	Learning	
Algorithms	to	DNA	Sequence	
Analysis",	(1989).
“How	Neural	Networks	
work",	(Lapedes,Farber
1987).
Deep-Learning	(learn	from	data	many	of	the	
things	we	do)
Speech	recognition	in	noisy	
environments	(Siri,	Cortana,	
Google,	Baidu,	…)
Better	than	human	
accuracy	face	
recognition
Self-driving	cars
• Internet	Search • Robotics • Self	guiding	drones • Much,	much,	more
Speech	recognition	is	a	Bellwether
A	driving	force	
for	ubiquitous	
inferencing	in	
the	data	center
Expect	amazing	growth	“$10T	incremental	value”	
with	1000x	increase	in	data	volume
• CEO	Saudi	Telecom	statement	during	his	KAUST	Global	IT	Keynote
• “We	expect	5G	to	increase	the	volume	of	mobile	data	by	1,000x”
• $10T	incremental	value
Khalid	Bin	Hussein	Bayari
CEO,	Saudi	Telecom
See	also:	http://www.mwc.gr/presentations/2016/kolokotronis.pdf and	https://www.itu.int/en/ITU-T/Workshops-
and-Seminars/standardization/201603/Documents/Abstracts-Presentations/S2P3_Ali_Amer.pptx
Source:	METIS
From	NetTalk to	Bioinformatics
Internal	
connections
The	phoneme	to	
be	pronounced
NetTalk
Sejnowski,	T.	J.	and	Rosenberg,	C.	R.	(1986)	
NETtalk:	a	parallel	network	that	learns	to	read	
aloud,	Cognitive	Science,	14,	179-211	
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural
_network)
Internal	
connections
t te X A TC G T
"Applications	of	Neural	Net	and	Other	Machine	
Learning	Algorithms	to	DNA	Sequence	Analysis",	
A.S.	Lapedes,	C.	Barnes,	C.	Burks,	R.M.	Farber,	K.	
Sirotkin,	Computers	and	DNA,	SFI	Studies	in	the	
Sciences	of	Complexity,	vol.	VII,	Eds.	G.	Bell	and	
T.	Marr,	Addison-Wesley,	(1989).
T|F	Exon	region
From	Bioinformatics	to	drug	design
(The	closer	you	look	the	greater	the	complexity)
Electron	Microscope
We	formed	a	company,	then	“The	Question”
How	do	we	know	you	
are	not	playing	
expensive	computer	
games	with	our	money?
Train	then	utilize	a	blind	test
Internal	
connections
A0
Binding	
affinity	for	a	
specific	
antibody
A1 A2 A3 A4 A5
Possible	hexamers	
206 =	64M
1k	– 2k	pseudo-random
(hexamer,	binding)
affinity	pairs
Approx.	0.001%	
sampling
“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan
Lapedes and Robert Farber, Los Alamos National Laboratory
Technical Report LA-UR-94-4391 (1994).
Hill	climbing	to	find	high	affinity
Internal	
connections
A0
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦'()*+,-.
A1 A2 A3 A4 A5
Learn:	
𝐴𝑓𝑓𝑖𝑛𝑖𝑡𝑦'()*+,-. = 𝑓 𝐴0, …, 𝐴3
𝑓(F,F,F,F,F,F)
𝑓(F,F,F,F,F,L)
𝑓(F,F,F,F,F,V) 𝑓(F,F,F,F,L,L)
𝑓(P,C,T,N,S,L)
Predict	P,C,T,N,S,L	has	the	
highest	binding	affinity
Confirm	
experimentally
Two	important	points
• The	computer	appears	to	correctly	predict	experimental	data
• Demonstrated	that	complex	binding	affinity	relationships	can	be	
learned	from	a	small	set	of	samples
• Necessary	because	it	is	only	possible	to	sample	a	very	small	subset	of	the	
binding	affinity	landscape	for	drug	candidates
Time	series
Iterate
𝑋)56 = 𝑓 𝑋), 𝑋)76, 𝑋)78, …
𝑋)58 = 𝑓 𝑋)56, 𝑋), 𝑋)76, …
𝑋)59 = 𝑓 𝑋)58, 𝑋)56, 𝑋), …
𝑋)5: = 𝑓 𝑋)59, 𝑋)58, 𝑋)56, …
Internal	
connections
Xt Xt-1
Learn:	
𝑋)56 = 𝑓 𝑋), 𝑋)76, 𝑋)78, …
Xt-2 Xt-3 Xt-4 Xt-5
Xt+1
Works	great!	(better	than	other	
methods	at	that	time)
"How	Neural	Nets	Work",	A.S.	Lapedes,	R.M.	Farber,	
reprinted	in	Evolution,	Learning,	Cognition,	and	Advanced	
Architectures,	World	Scientific	Publishing.	Co.,	(1987).
Pt+1
‘Sliding	inference’	during	training	to	increase	
accuracy
XtXt-1
Internal	
connections
Xt-2Xt-3Xt-4Xt-5 Xt+1 Xt+2 Xt+3
Pt+3
Error(example)	=	∑ (𝑋)5*−𝑃)5*
9
*?6 )2
Pt+1
Pt+2Pt+2
Designing	ANNs	for	Integration	and	
Bifurcation	analysis	– “training	a	netlet”
"Identification of Continuous-Time Dynamical Systems:
Neural Network Based Algorithms and Parallel
Implementation", R. M. Farber, A. S. Lapedes, R.
Rico-Martinez and I. G. Kevrekidis, Proceedings of the 6th
SIAM Conference on Parallel Processing for Scientific
Computing, Norfolk, Virginia, March 1993.
ANN	schematic	for	continuous-time	
identification.	(a)	A	four-layered	ANN	based	on	
a	fourth	order	Runge-Kutta integrator.	(b)	ANN	
embedded	in	a	simple	implicit	integrator.
(a)	Periodic	attractor	of	the	Van	der	Pol	oscillator	for	g=	1.0,	
d=	4.0	and		w =	1.0.	The	unstable	steady	state	in	the	interior	
of	the	curve	is	marked	+.	(b)	ANN-based	predictions	for	the	
attractors	of	the	Van	der	Pol	oscillator	shown	in	(a).
Dimension	reduction
• The	curse	of	dimensionality
• People	cannot	visualize	data	beyond	3D	+	color
• Search	volume	rapidly	increases	with	dimension
• Queries	return	too	much	data	or	no	data
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
I I I I I
B
Sensor	1 Sensor	2 Sensor	3 Sensor	N
Sensor	1 Sensor	2 Sensor	3 Sensor	N
Sensor	1 Sensor	2 Sensor	3 Sensor	N
Sensor	1 Sensor	2 Sensor	3 Sensor	N
Sensor	1 Sensor	2 Sensor	3 Sensor	N
X Y Z
A	general	SIMD	mapping:	
Optimize(LMS_Error =	objFunc(p1,	p2,	…	pn))
17
Examples
0, N-1
Examples
N, 2N-1
Examples
2N, 3N-1
Examples
3N, 4N-1
Step 2
Calculate partials
Step1
Broadcast
parameters
Optimization Method
(Powell, Conjugate Gradient, Other)
Step 3
Sum partials to get
energy
GPU 1 GPU 2 GPU 3
p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn
GPU 4
Host
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000 3500
Average	Sustained	TF/s
Number	of	Intel	Xeon	Phi	coprocessors/Sandy	Bridge	nodes
TACC	Stampede	PCA	scaling
Many	problems	are	too	big	for	a	single	computer	–
Strong	scaling	execution	model!
Perfect	strong	scaling	decreases	runtime	linearly	by	
the	number	of	processing	elements	
• O(LogN)	scaling	is	“good	enough"
See	a	path	to	exascale
(MPI	can	map	to	thousands	of	GPU	or	Processor	nodes)
19
Always	report	“Honest	Flops”
Expect	significant	algorithm	and	HW	retooling
Today:	“only	7%	of	all	servers	being	used	for machine	learning	and	only	0.1%	are	
running	deep	neural	nets”	– Forbes
By	2020,	100%	of	servers	running	machine	learning	– Intel,	NVIDIA,	Wall	Street
CPU GPU
FPGA Chips
Roughly	four	different	camps
NVIDIA	(GPUs)
• Restarted	massive	parallelism	with	CUDA	and	GPU	computing
• Making	big	inroads	into	the	data	center
GPU	Threads	are	grouped	into	threadblocks
• Threads	can	only	communicate	within	a	thread	block
• (yes,	there	are	atomic	ops)
• Fast	hardware	scheduling
• Blks run	when	dependencies	resolved
Data
• Blocks	that	are	ready	to	run	get	assigned	to	processing	elements
• Fast	hardware	scheduling
Scalability	required	to	use	all	those	cores	
(strong	scaling	execution	model)
Active	Queue
Executables	can	run	unchanged on	bigger	
GPUs
• Dealer	Analogy
Scheduler SMX
Strong	Scaling	
Execution	Model
NVIDIA	Claims	Big	Perf.	Increases	since	2013
NVIDIA	on	speech	recognition
NVIDIA	Claims	for	P40	using	INT8	math	&	TensorRT
Processor-based	computing
AMD
Intel
Traditional	Vector	ISA
Corewide	
SSE
wide	
SSE
wide	
SSE
wide	
SSE
Core
Core
512	wide	vector	unit
512	wide	vector	unit
Core 512	wide	vector	unit
Core 512	wide	vector	unit
Illustration
Floating-point	performance	comes	from	the	dual	
per	core	vector	units
• AVX-512	=	16	32-bit	ops/clock
P
e
r
f
o
r
m
a
n
c
e Scalar	and	
single-threaded
Parallelism
No	Parallelism Massive	Parallelism
Vector	and	
single-threaded
Image	courtesy	
Elsevier
Highest	
Performance
Convergence	(for	training	and	HPC	in	general)
• NVIDIA	Pascal	has	a	working	MMU	(Memory	Management	Unit)
• Data	can	be	automatically	be	moved	between	CPU	and	GPU	on	a	demand	basis.
• Offload	programming	is	no	longer	a	requirement,	its	an	optimization!
• This	is	a	really	big	deal	as	code	changes	are	a	barrier	to	GPU	adoption
• Pascal	GPUs	have	fast	stacked	memory	(Much	faster,	more	capacity,	energy	efficient)	
• and	NVlink (fast	host/GPU	memory	bandwidth	transfer	– but	only	with	IBM!)
• Intel	Xeon	Phi	(formerly	known	as	Knights	Landing):
• Data	can	be	automatically	moved	between	near	(stacked)	and	far	(DDR4)	memory	on	a	
demand	basis	using	cache	mode.
• Offload	programming	is	no	longer	a	requirement,	its	an	optimization!
• Stacked	memory	is	much	faster,	more	capacity,	energy	efficient
• IBM
• Data	can	be	automatically	moved	between	Power	and	GPU	on	a	demand	basis	using	
NVlink.
• Offload	programming	is	no	longer	a	requirement,	its	an	optimization!
IBM	approach
• Sumit Gupta	(VP,	High	Performance	Computing	and	Data	Analytics,	
IBM),	“fundamentally,	accelerators	are	the	path	forward.”	
• These	accelerators	are	GPUs	for	compute,	storage	accelerators	for	big	data	
and	FPGAs	for	special	functions.
• Watson	(of	Jeopardy	fame)	for	software
• TrueNorth
• Developed	as	part	of	the	DARPA	SyNAPSE program
• 46	Billion	synaptic	op/s	using	70	mW!
Sumit Gupta
The	OpenPower ‘special	sauce’
1. CAPI	(IBM	Coherent	Accelerator	Processor	Interface)
2. NVlink used	in	the	CORAL	"Summit"	and	"Sierra“	
supercomputers
• Make	application	acceleration:
• Much	easier	
• Transparent	for	the	application	programmer.	
• ‘Open’	for	all	to	join
CAPI	is	important
• Supports	compute,	storage,	and	
special	(e.g.	FPGA)	accelerators
• Shares	virtual	addressing	–
everything	works	with	same	
memory	addresses
• Provides	hardware	managed	
cache	coherence
• Claims	to	eliminate	97%	of	code	
path	length!
Data	handling	can	take	as	much	time	as	the	
computational	problem!
• ORNL	Titan
– 112,128	GB	of	GPU	memory	in	18,688	K20x	GPUs
• Data	handling	must	be
– Language	agnostic
– Scalable
OpenPower storage	accelerators
• 56	Terabyte	‘extended’	memory	on	Power8	using	flash
Databases	are	important	to	processing	data	for	training	sets	and	more
NoSQL	(Not	Only	SQL)	Databases
Server	throughput	in	Ops/Sec	(50/50	read/write	ratio)	Image	courtesy	IBM
SPEC	M3	Benchmark	on	68	TB	CAPI			system
• Fastest	mean	response	times	and	most	consistent	response	times	(lowest	
standard	deviation)	ever	reported,	for	all	combinations	of	query	type,	data	
volume,	and	concurrent	users.
• Each	mean	response	time	was	5.5x	to	212x	the	previous	best	result,	
including:
• 21x	to	212x	the	performance	of	the	previous	best	published	result	for	the	market	
snap	benchmarks	(10T.YR[n]-MKTSNAP.TIME)	*
• 21x	the	performance	of	the	previous	best	published	result	for	year	high	bid	in	the	
smallest	year	of	the	dataset	(1T.OLDYRHIBID.TIME)	*
• 8-10x	the	performance	of	the	previous	best	published	result	for	the	100-user	
volume-weighted	average	bid	benchmarks	(100T.YR[n]VWAB-12D-HO.TIME)	**
• 5-8x	the	performance	of	the	previous	best	published	result	for	the	N-year	high-bid	
benchmarks	(1T.[n]YRHIBID.TIME)
Fast	scalable	data	loads	for	training	via	
parallel	file	systems
Node	
1
Node	
2
Node	
3
Node	
4
Node	
5
Node	
6
Node	
7
Node	
500
Each	MPI	client	on	
each	node:
1. Opens	file
2. Seeks	to	location
3. Reads	data
4. Close
Other	training	and	inference	solutions
FPGAs
• Fast,	power	efficient	and	perfect	for	variable	precision	arithmetic
• Accessible	via	CAPI
• Very	difficult	to	program
Custom	chips
• Fast,	power	efficient	and	perfect	for	variable	precision	arithmetic
Heatsink	City
• A is	quad	Google	TPU2	motherboard	side	
view
• B is	dual	IBM	Power9	“Zaius”	motherboard
• C is	dual	IBM	Power8	“Minsky”	
motherboard
• D is	Dual	Intel	Xeon	Facebook	“Yosemite”	
motherboard
• E is	Nvidia P100	SMX2	module	with	heat	
sink	and	Facebook	“Big	Basin”	motherboard
(Image	courtesy	The	Next	Platform)
Nervana and	many	other	offerings
Google	TPU2:	
• 15x-30x	faster	than	CPU	and	GPU,	“On	
our	production	AI	workloads	that	utilize	
neural	network	inference”
• 30x	to	80x	improvement	in	TOPS/Watt	
measure
• Exclusive	to	Google	Cloud
IBM	TrueNorth
(A	path	to	the	future???)
• PNAS	(8/9/16)	Convolutional	networks	for	fast,	energy-efficient	
neuromorphic	computing,	Steven	K.	Essera et.	al.
• Chip	implements	networks	using	integrate-and-fire	spiking	neurons
• IBM	researchers	ran	the	datasets	at	between	1,200	and	2,600	frames/s	and	
using	between	25	and	275	mW (effectively	>6,000	frames/s	per	watt)
• Can	go	really	big
A	system	roughly	the	neuronal	size	of	a	rat	brain
TrueNorth:	Accuracy
• PNAS	paper:	“[We]	demonstrate	that	neuromorphic	
computing	…	can	implement	deep	convolution	networks	
that	approach	state-of-the-art	classification	accuracy
across	eight	standard	datasets	encompassing	vision	and	
speech,	perform	inference	while	preserving	the	hardware’s	
underlying	energy-efficiency	and	high	throughput.”
Accuracy of different sized networks running on
one or more TrueNorth chips to perform
inference on eight datasets. For comparison,
accuracy of state-of-the-art unconstrained
approaches are shown as bold horizontal lines
(hardware resources used for these networks are
not indicated).
Common	software
• Theano:	A	Python	library	that	generates	C-code	for	a	CPU	or	GPU
• TensorFlow:	Google’s	open	source	library	for	machine	learning
• cuDNN:	NVIDIA’s	machine	learning	library
• Intel	DAAL	(Data	Analytics	Library)
• Torch:	An	open	source	middleware	library
• Caffe:	Berkeley’s	popular	framework
Speaking	of	accuracy	and	software	…
• Symbolically	calculated	gradients	(Jacobians)	are	important	as	they	greatly	
assist	the	search	for	good	solutions.	Think	L-BFGS,	Conjugate	Gradient,	…
• Use	of	a	gradient	provides	an	algorithmic	speedup	that	can	achieve	orders	of	
magnitude	faster	time-to-model	as	well	as	better	solutions.
• Getting	the	gradient	is	pretty	easy	with	popular	software	packages	such	as	Theano.
• Big	memory	is	required	to	perform	the	gradient	calculation
• The	size	of	the	gradient	gets	very	large,	very	fast	as	the	number	of	parameters	in	the	
ANN	model	increases.	
• Memory	capacity	and	bandwidth	limitations	(plus	cache	and	potentially	atomic	
instruction	performance)	dominate	the	runtime	of	the	gradient	calculation.
• The	size	of	the	code	can	exceed	GPU	instruction	memory	capacity.
• Definitely	a	place	for	big	memory	many-core	processors!
• Like	Power	and	Nvlink as	data	is	shared	between	all	devices	efficiently
Machine	1
App	A
App	B
App	C
App	D CPU Load-balancing	
Splitter
App	A
App	B
App	C
App	D CPU
Machine	2
App	A
App	B
App	C
App	D CPU
Machine	3
Fast	and	scalable	heterogeneous	workflows
Full	source	code		in	my	DDJ	tutorial	
http://www.drdobbs.com/parallel/232601605
Volta
FPGA
Custom	Asic
So	much	more,	you	have	been	great,	Thank	You!
Rob	Farber	
CEO	TechEnablement.com	
Contact	info	at	techenablement.com for	consulting,	teaching,	writing,	and	other	inquiries

Understanding why Artificial Intelligence will become the most prevalent server workload by 2020