© 2015 ligaDATA, Inc. All Rights Reserved.
Using Deep Learning to
do Real-Time Scoring in
Practical Applications
Deep Learning Applications Meetup, Monday, 12/14/2015, Mountain View, CA
By Greg Makowski
Community @ 
Try out
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	and	Lambda	Architecture	
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
Gartner’s	Top	2016	Strategic	Technology	Trends	
Gartner’s	Top	2016	Strategic	Technology	Trends	
Advantages	of	a	Net		
over	Regression	
field	1	
field	2	
$	$	
$	 $	
c	 c	
c	 c	
A	Regression		
Fit	one	Line	
$	 c	
Target	values	for	a	
data	point	with	source		
field	values	graphed	by	
	“field	1”	and	“field	2”	
Showing ONE target field, with values of $ or c
Advantages	of	a	Net		
over	Regression	
field	1	
field	2	
$	$	
$	 $	
c	 c	
c	 c	
A	Neural	Net		
which	are	
not	adjacent	
nodes	can	be	
line	or	circle
A	Comparison	of	a	Neural	Net	
	and	Regression	
A	Logis(c	regression	formula:	
					Y	=	f(	a0	+	a1*X1	+	a2*X2	+	a3*X3)	
																a*	are	coefficients	
Backpropaga(on,	cast	in	a	similar	form:	
				H1		=	f(w0	+	w1*I1	+	w2*I2	+	w3*I3)	
				H2		=	f(w4	+	w5*I1	+	w6*I2	+	w7*I3)	
				Hn	=	f(w8	+	w9*I1	+	w10*I2	+	w11*I3)	
				O1	=	f(w12	+	w13*H1	+	....	+	w15*Hn)	
				On	=	....	
				w*	are	weights,	AKA	coefficients	
				I1..In				are	input	nodes	or	input	variables.	
				H1..Hn	are	hidden	nodes,	which	extract	features	of	the	data.	
				O1..On	are	the	outputs,	which	group	disjoint	categories.	
Look	at	raBo	of	training	records	v.s.	free	parameters	(complexity,	regularizaBon)	
a1	 a2	 a3	
X1	 X2	 X3	
Input	1	 I2	 I3	
H1	 Hidden	2	
Think	of	SeparaBng	Land	vs.	Water	
1 line,
(more errors)
5 Hidden Nodes in
a Neural Network
Different algorithms use
different Basis Functions:
•  One line
•  Many horizontal & vertical lines
•  Many diagonal lines
•  Circles
Decision Tree
12 splits
(more elements,
Less computation)
Q) What is too detailed? “Memorizing high tide boundary” and applying it at all times
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	and	Lambda	Architecture		
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
Leading	up	to	an	Auto	Encoder			
•  Supervised	Learning	
–  Regression,	Tree	or	Net:			50	inputs	à	1	output	
–  Possible	nets:	
•  256	à	120	à	1	
•  256	à	120	à	5														(trees,	regressions	and	most	are	limited	to	1	output)	
•  256	à	120	à	60	à	1	
•  256	à	180	à	120	à	60	à	1						(start	gemng	into	training	stability	problems,	with	old			
•  Unsupervised	Learning	
–  Clustering	(tradiBonal	unsupervised):	
•  60	inputs	(no	target);				produce	1-2	new		(cluster	ID	&	distance)
Auto	Encoder		(like	data	compression)	
Relate	input	to	output,	through	compressed	middle	
•  Supervised	Learning	
–  Regression,	Tree	or	Net:			50	inputs	à	1	output	
–  Possible	nets:	
•  256	à	120	à	1	
•  256	à	120	à	5														(trees,	regressions,	SVD	and	most	are	limited	to	1	output)	
•  256	à	120	à	60	à	1	
•  256	à	180	à	120	à	60	à	1	
•  Unsupervised	Learning	
–  Clustering	(tradiBonal	unsupervised):	
•  60	inputs	(no	target);				produce	1-2	new		(cluster	ID	&	distance)	
–  Unsupervised	training	of	a	net,	assign	(target	record	==	input	record)		AUTO-ENCODING	
–  Train	net	in	stages,	
•  256	à	180	à	256	
à	120	à	
à	120	à	
à	120	à
•  Add	supervised	layer	to	forecast	10	target	categories	
à	10	
Because of symmetry,
Only need to update
mirrored weights once
(start getting long training times to stabilize, or may not finish,
4 hidden layers w/ unsupervised training
1 layer at end w/ supervised training
Auto	Encoder	
How	it	can	be	generally	used	to	solve	problems	
•  Add	supervised	layer	to	forecast	10	target	categories	
–  4	hidden	layers	trained	with	unuspervised	training,	 	
–  1	new	layer,	trained	with	supervised	learning		
à	10	
•  Outlier	detecBon	
•  The	“acBvaBon”	at	each	of	the	120	output	nodes	indicates	the	“match”	to	that	
cluster	or	compressed	feature	
•  When	scoring	new	records,	can	detect	outliers	with	a	process	like	
If	(	max_output_match	<	0.333)	then	suspected	outlier	
•  How	is	it	like	PCA? 		
–  Individual	hidden	nodes	in	the	same	layer	are	“different”	or	“orthogonal”
How	Transferable	are	Features	in		
Deep	Neural	Networks?
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	and	Lambda	Architecture		
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
Deep	Learning	Caused	a	50%	ReducBon	
in	Speech	recogniBon	error	rates	in	4	yrs	
“The	use	of	deep	neural	nets	in	
producBon	speech	systems	really	
started	more	like	in	2011...		
I	would	esBmate	that	from	the	
Bme	before	deep	neural	nets	
were	used	unBl	now,	the	error	
rate	on	producBon	speech	
systems	fell	from	about	20%	
down	to	below	10%,	so	more	
than	a	50%	reducBon	in	error	
rate.”		-	Jeff	Dean	email	to	Greg	
Senior Fellow in the Knowledge Group
Drop in Speech Rec. Error Rates
Deep Learning
Internet	of	Things	(IoT)	is	heavily	signal	data
ConvoluBonal	Neural	Net	(CNN)	
Enables	detecBng	shiK	invariant	paxerns	
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Neural Nets can be explicitly trained to provide a FFT (Fast Fourier Transform)
to convert data from time domain to the frequency domain – but typically an explicit FFT is used
Internet	of	
Things	Signal	Data
ConvoluBonal	Neural	Net	(CNN)	
Enables	detecBng	shiK	invariant	paxerns	
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Solution: use a siding convolution to detect the pattern
CNN can use very long observational windows, up to 400 ms, long context
ConvoluBon	Neural	Net:			
from	LeNet-5	
Gradient-Based	Learning	Applied	to	Document	RecogniBon	
Proceedings	of	the	IEEE,	Nov	1998	
Yann	LeCun,	Leon	Boxou,	Yoshua	Bengio	and	Patrick	Haffner	
Facebook, AI Research
Auto	Encoder		(like	data	compression)	
Relate	input	to	output,	through	compressed	middle
ConvoluBon	Neural	Net	(CNN)	
•  How	is	a	CNN	trained	differently	than	a	typical	back	
propagaBon	(BP)	network?	
–  Parts	of	the	training	which	is	the	same:	
•  Present	input	record	
•  Forward	pass	through	the	network		
•  Back	propagate	error	(i.e.	per	epoch)	
–  Different	parts	of	training:	
•  Some	connecBons	are	CONSTRAINED	to	the	same	value	
–  The	connecBons	for	the	same	paxern,	sliding	over	all	input	space	
•  Error	updates	are	averaged	and	applied	equally	to	the	one	set	of	weight	
•  End	up	with	the	same	paxern	detector	feeding	many	nodes	at	the	next	level
Convolutional Deep Belief Networks for Scalable
Unsupervised Learning of Hierarchical Representations, 2009
ConvoluBon	Neural	Net	(CNN)	
Same	Low	Level	Features
The	Mammalian	Visual	Cortex	is	Hierarchical	
(The	Brain	is	a	Deep	Neural	Net	-	Yann	LeCun)
ConvoluBon	Neural	Net	(CNN)	
Facebook	example
ConvoluBon	Neural	Net	(CNN)	
Yahoo	+	Stanford	example	–	find	a	face	in	a	pic,	even	upside	down
ConvoluBonal	Neural	Nets	(CNN)	
RoboBc	Grasp	DetecBon		(IoT)
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	and	Lambda	Architecture		
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
Real	Time	Scoring	
•  Auto-Encoding	nets	
–  Can	grow	to	millions	of	connecBons,	and	start	to	get	computaBonal	
–  Can	reduce	connecBons	by	5%	to	25+%	with	pruning	&	retraining	
•  Train	with	increased	regularizaBon	semngs	
•  Drop	connec(ons	with	near	zero	weights,	then	retrain	
•  Drop	nodes	with	fan	in	connecBons	which	don’t	get	used	much	later,	such	as	in	
your	predicBve	problem	
•  Perform	sensiBvity	analysis	–	delete	possible	input	fields	
•  ConvoluBonal	Neural	Nets	
–  With	large	enough	data,	can	even	skip	the	FFT	preprocessing	step		
–  Can	use	wider	than	10ms	audio	sampling	rates	for	speed	up	
•  Implement	other	preprocessing	as	lookup	tables	(i.e.	Bayesian	Priors)	
•  Use	cloud	compuBng,	do	not	limit	to	device	compuBng	
•  Large	models	don’t	fit	à	use	model	or	data	parallelism	to	train
© 2015 ligaDATA, Inc. All Rights Reserved.
Real	Time	Scoring	
Lambda	Architecture	–	for	both	Batch	and	Real	Time	
•  First	architecture	to	really	define	how	batch	and	stream	processing	can	work	together	
•  Founded	on	the	concepts	of	immutability	and	re-computaBon,	with	human	fault	tolerance	
•  Pre-computes	the	results	of	batch	&	real-Bme	processes	as	a	set	of	views,	&	query	layer	
merges	the	views
© 2015 ligaDATA, Inc. All Rights Reserved.
Real	Time	Scoring	
Lambda	Architecture	With	Kamanja	
Real time
Views &
•  Kamanja	embraces	and	extends	Lambda	architecture	
•  Transform	and	process	messages	in	real-Bme,	combine	messages	with	historical	
data	and	compute	real-Bme	views	to	make	real-Bme	decisions	based	on	the	views	
© 2015 ligaDATA, Inc. All Rights Reserved.
Real	Time	Compu(ng	
Kamanja Technology Stack 

(PMML,	Java	or	Scala	Consumer)	
High level languages / abstractions
Cloud, EC2
Internal Cloud
Real Time
Data Store
adaptors to
High Level Languages /
PMML Producers, MLlib
Deep	Net	Tools
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	and	Lambda	Architecture		
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
Deep	Reinforcement	Learning,	Q-Learning David Silver, Google DeepMind
Think	in	terms	of	IoT….	
Device	agent	measures,	infers	user’s	acBon	
Maximizes	future	reward,	recommends	to	user	or	system
Deep	Reinforcement	Learning,	Q-Learning	
(Think	about	IoT	possibiliBes)
David Silver, Google DeepMind
Use 4
Deep	Reinforcement	Learning,	Q-Learning David Silver, Google DeepMind
Use 4
Use 4 screen shots
IoT challenge: How to replace game
score with IoT score?
Shift right fast
shift right
shift left
shift left fast
Deep	Reinforcement	Learning,	Q-Learning David Silver, Google DeepMind
Games w/ best Q-learning
Video Pinball
Star Gunner
Crazy Climber
Deep	Learning	-	Outline	
•  Big	Picture	of	2016	Technology	
•  Neural	Net	Basics	
•  Deep	Network	ConfiguraBons	for	PracBcal	ApplicaBons	
–  Auto-Encoder	(i.e.	data	compression	or	Principal	Components	Analysis)	
–  ConvoluBonal	(shiK	invariance	in	Bme	or	space	for	voice,	image	or	IoT)	
–  Real	Time	Scoring	
–  Deep	Net	libraries	and	tools	(R,	H2O,	DL4J,	TensorFlow,	Gorila,	Kamanja)		
–  Reinforcement	Learning,	Q-Learning	(i.e.	beat	people	at	Atari	games,	IoT)	
–  ConBnuous	Space	Word	Models	(i.e.	word2vec)
ConBnuous	Space	Word	Models	(word2vec)	
•  Before	(a	predicBve	“Bag	of	Words”	model):	
–  One	row	per	document,	paragraph	or	web	page	
–  Binary	word	space:	10k	to	200k	columns,	one	per	word	or	phrase	
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	….		“This	word	space	model	is	….”	
–  The	“Bag	of	words	model”	relates	input	record	to	a	target	category
ConBnuous	Space	Word	Models	(word2vec)	
•  Before	(a	predicBve	“Bag	of	Words”	model):	
–  One	row	per	document,	paragraph	or	web	page	
–  Binary	word	space:	10k	to	200k	columns,	one	per	word	or	phrase	
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	….		“This	word	space	model	is	….”	
–  The	“Bag	of	words	model”	relates	input	record	to	a	target	category	
•  New:	
–  One	row	per	word	(word2vec),	possibly	per	sentence	(sent2vec)	
–  Con(nuous	word	space:	100	to	300	columns,	conBnuous	values	
.01		.05		.02		.00		.00		.68		.01		.01		.35		...		.00		à		“King”	
.00		.00		.05		.01		.49		.52		.00		.11		.84		...		.01		à		“Queen”	
–  The	deep	net	training	resulted	in	an	Emergent	Property:	
•  Numeric	geometry	locaBon	relates	to	concept	space	
•  “King”	–	“man”	+	“woman”		=		“Queen”					(math	to	change	gender	relaBon)	
•  “USA”	–	“Washington	DC”	+	“England”		=		“London”	(math	for	capital	relaBon)
ConBnuous	Space	Word	Models	(word2vec)	
How	to	SCALE	to	larger	vocabularies?
Training	ConBnuous	Space	Word	Models	
•  How	to	Train	These	Models?	
–  Raw	data:			“This	example	sentence	shows	the	word2vec	model	training.”
Training	ConBnuous	Space	Word	Models	
•  How	to	Train	These	Models?	
–  Raw	data:			“This	example	sentence	shows	the	word2vec	model	training.”	
–  Training	data	(with	target	values	underscored,	and	other	words	as	input)	
“This									example			sentence			shows						word2vec”															(prune	“the”)	
“example	sentence				shows					word2vec			model”	
“sentence			shows				word2vec			model						training”	
–  The	context	of	the	2	to	5	prior	and	following	words	predict	the	middle	
–  Deep	Net	model	architecture,	data	compression	to	300	conBnuous	nodes	
•  50k	binary	word	input	vector	à		...		à	300	à	...		à	50k	word	target	vector
Training	ConBnuous	Space	Word	Models	
•  How	to	Train	These	Models?	
–  Raw	data:			“This	example	sentence	shows	the	word2vec	model	training.”	
–  Training	data	(with	target	values	underscored,	and	other	words	as	input)	
“This									example			sentence			shows						word2vec”															(prune	“the”)	
“example	sentence				shows					word2vec			model”	
“sentence			shows				word2vec			model						training”	
–  The	context	of	the	2	to	5	prior	and	following	words	predict	the	middle	
–  Deep	Net	model	architecture,	data	compression	to	300	conBnuous	nodes	
•  50k	binary	word	input	vector	à		...		à	300	à	...		à	50k	word	target	vector	
•  Use	Pre-Trained	Models					hxps://		
–  Trained	on	100	billion	words	from	Google	News	
–  300	dim	vectors	for						3	million	words	and	phrases	
–  hxps://
Training	ConBnuous	Space	Word	Models
Applying	ConBnuous	Space	Word	Models
State of the art in machine translation
Sequence to Sequence Learning with neural Networks, NIPS 2014
Language	translaBon	
Document	summary	
Generate	text	capBons	for	pictures	
“Greg’s	Guts”	on	Deep	Learning	
•  Some	claim	the	need	for	preprocessing	and	knowledge	
representaBon	has	ended	
–  For	most	of	the	signal	processing	applicaBons	à	yes,	simplify	
–  I	am	VERY	READY	TO	COMPETE	in	other	applicaBons,	conBnuing		
•  expressing	explicit	domain	knowledge		
•  opBmizing	business	value	calculaBons	
•  Deep	Learning	gets	big	advantages	from	big	data	
–  Why?			Bexer	populaBng	high	dimensional	space	combinaBon	subsets	
–  Unsupervised	feature	extracBon	reduces	need	for	large	labeled	data	
•  However,	“regular	sized	data”	gets	a	big	boost	as	well	
–  The	“raBo	of	free	parameters”	(i.e.	neurons)	to	training	set	records	
–  For	regressions	or	regular	nets,	want	5-10	Bmes	as	many	records		
–  RegularizaBon	and	weight	drop	out	reduces	this	pressure	
–  Especially	when	only	training	“the	next	auto	encoding	layer”
Deep	Learning	Summary	–	ITS	EXCITING!	
•  Discussed	Deep	Learning	architectures	
–  Auto	Encoder,	convoluBonal,	reinforcement	learning,	conBnuous	word	
•  Real	Time	speed	up	
–  Train	model,	reduce	complexity,	retrain	
–  Simplify	preprocessing	with	lookup	tables	
–  Use	cloud	compuBng,	do	not	be	limited	to	device	compuBng	
–  Lambda	architecture	like	Kamanja,	to	combine	real	Bme	and	batch	
•  ApplicaBons	
–  Signal	Data:		IoT,	Speech,	Images	
–  Control	System	models	(like	Atari	game	playing,	IoT)	
–  Language	Models
© 2015 ligaDATA, Inc. All Rights Reserved.
Using Deep Learning to
do Real-Time Scoring in
Practical Applications
Deep Learning Applications Meetup, Monday, 12/14/2015, Mountain View, CA
By Greg Makowski
Community @ 
Try out

Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14