SlideShare a Scribd company logo
1 of 8
Download to read offline
Stochastic	Gradient	Descent	
algorithm	&	its	tuning	
	
Mohamed,	Qadri	
	
	
	
SUMMARY	
	
This	paper	talks	about	optimization	algorithms	used	
for	big	data	applications.	We	start	with	explaining	the	
gradient	descent	algorithms	and	its	limitations.	Later	
we	 delve	 into	 the	 stochastic	 gradient	 descent	
algorithms	and	explore	methods	to	improve	it	it	by	
adjusting	learning	rates.		
	
GRADIENT	DESCENT		
Gradient	 descent	 is	 a	 first	 order	
optimization	algorithm.	To	find	a	local	minimum	of	a	
function	 using	 gradient	 descent,	 one	 takes	 steps	
proportional	to	the	negative	of	the	gradient	(or	of	the	
approximate	gradient)	of	the	function	at	the	current	
point.	 If	 instead	 one	 takes	 steps	 proportional	 to	
the	positive	of	the	gradient,	one	approaches	a	local	
maximum	 of	 that	 function;	 the	 procedure	 is	 then	
known	as	gradient	ascent.	
Gradient	descent	is	also	known	as	steepest	descent,	
or	the	method	of	steepest	descent.	Gradient	descent	
should	not	be	confused	with	the	method	of	steepest	
descent	for	approximating	integrals.	
	
Using	 the	 Gradient	 Decent	 (GD)	 optimization	
algorithm,	 the	 weights	 are	 updated	 incrementally	
after	each	epoch	(=	pass	over	the	training	dataset).		
The	 cost	 function	 J(⋅),	 the	 sum	 of	 squared	 errors	
(SSE),	can	be	written	as:	
	
The	magnitude	and	direction	of	the	weight	update	is		
	
	
computed	by	taking	a	step	in	the	opposite	direction	
of	the	cost	gradient	
	
where	 η	 is	 the	 learning	 rate.	 The	 weights	 are	 then	
updated	 after	 each	 epoch	 via	 the	 following	 update	
rule:	
	
where	Δw	is	a	vector	that	contains	the	weight	updates	
of	each	weight	coefficient	w,	which	are	computed	as	
follows:	
	
Essentially,	we	can	picture	GD	optimization	as	a	hiker	
(the	weight	coefficient)	who	wants	to	climb	down	a	
mountain	 (cost	 function)	 into	 a	 valley	 (cost	
minimum),	 and	 each	 step	 is	 determined	 by	 the	
steepness	of	the	slope	(gradient)	and	the	leg	length	of	
the	hiker	(learning	rate).	Considering	a	cost	function	
with	only	a	single	weight	coefficient,	we	can	illustrate	
this	concept	as	follows:	
GRADIENT	DESCENT	VARIANTS	
There	are	three	variants	of	gradient	descent,	which	
differ	 in	 how	 much	 data	 we	 use	 to	 compute	 the	
gradient	of	the	objective	function.	Depending	on	the	
amount	of	data,	we	make	a	trade-off	between	the	
accuracy	 of	 the	 parameter	 update	 and	 the	 time	 it	
takes	to	perform	an	update.
Batch	Gradient	Descent	
Vanilla	gradient	descent,	aka	batch	gradient	descent,	
computes	the	gradient	of	the	cost	function	w.r.t.	to	
the	parameters	θ	for	the	entire	training	dataset:	
	
As	we	need	to	calculate	the	gradients	for	the	whole	
dataset	to	perform	just	one	update,	batch	gradient	
descent	 can	 be	 very	 slow	 and	 is	 intractable	 for	
datasets	 that	 don't	 fit	 in	 memory.	 Batch	 gradient	
descent	 also	 doesn't	 allow	 us	 to	 update	 our	
model	online,	i.e.	with	new	examples	on-the-fly.	
In	code,	batch	gradient	descent	looks	something	like	
this:	
for i in range(nb_epochs):
params_grad =
evaluate_gradient(loss_function,
data, params)
params = params - learning_rate *
params_grad
For	 a	 pre-defined	 number	 of	 epochs,	 we	 first	
compute	 the	 gradient	 vector	 weights_grad of	
the	 loss	 function	 for	 the	 whole	 dataset	 w.r.t.	 our	
parameter	 vector	 params.	 We	 then	 update	 our	
parameters	in	the	direction	of	the	gradients	with	the	
learning	rate	determining	how	big	of	an	update	we	
perform.	 Batch	 gradient	 descent	 is	 guaranteed	 to	
converge	 to	 the	 global	 minimum	 for	 convex	 error	
surfaces	 and	 to	 a	 local	 minimum	 for	 non-convex	
surfaces.	
Stochastic	Gradient	Descent	
Stochastic	 gradient	 descent	 (SGD)	 in	 contrast	
performs	 a	 parameter	 update	 for	 each	 training	
example	x(i)x(i)and	label	y(i)y(i):	
θ=θ−η⋅∇θJ(θ;x(i);y(i))	
Batch	 gradient	 descent	 performs	 redundant	
computations	 for	 large	 datasets,	 as	 it	 recomputes	
gradients	for	similar	examples	before	each	parameter	
update.	 SGD	 does	 away	 with	 this	 redundancy	 by	
performing	 one	 update	 at	 a	 time.	 It	 is	 therefore	
usually	 much	 faster	 and	 can	 also	 be	 used	 to	 learn	
online.		 	
SGD	performs	frequent	updates	with	a	high	variance	
that	cause	the	objective	function	to	fluctuate	heavily	
as	in	Image.	
While	 batch	 gradient	 descent	 converges	 to	 the	
minimum	of	the	basin	the	parameters	are	placed	in,	
SGD's	fluctuation,	on	the	one	hand,	enables	it	to	jump	
to	new	and	potentially	better	local	minima.	On	the	
other	hand,	this	ultimately	complicates	convergence	
to	the	exact	minimum,	as	SGD	will	keep	overshooting.	
However,	 it	 has	 been	 shown	 that	 when	 we	 slowly	
decrease	 the	 learning	 rate,	 SGD	 shows	 the	 same	
convergence	 behavior	 as	 batch	 gradient	 descent,	
almost	certainly	converging	to	a	local	or	the	global	
minimum	 for	 non-convex	 and	 convex	 optimization	
respectively.		 	
Its	code	fragment	simply	adds	a	loop	over	the	training	
examples	 and	 evaluates	 the	 gradient	 w.r.t.	 each	
example.	
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad =
evaluate_gradient(loss_function,
example, params)
params = params - learning_rate
* params_grad
	
In	both	gradient	descent	(GD)	and	stochastic	gradient	
descent	(SGD),	you	update	a	set	of	parameters	in	an	
iterative	manner	to	minimize	an	error	function.	 	
	
While	in	GD,	you	have	to	run	through	ALL	the	samples	
in	 your	 training	 set	 to	 do	 a	 single	 update	 for	 a	
parameter	 in	 a	 particular	 iteration,	 in	 SGD,	 on	 the	
other	hand,	you	use	ONLY	ONE	training	sample	from	
your	training	set	to	do	the	update	for	a	parameter	in	
a	particular	iteration.	 	
	
Thus,	if	the	number	of	training	samples	are	large,	in	
fact	very	large,	then	using	gradient	descent	may	take	
too	 long	 because	 in	 every	 iteration	 when	 you	 are	
updating	 the	 values	 of	 the	 parameters,	 you	 are	
running	 through	 the	 complete	 training	 set.	 On	 the
other	hand,	using	SGD	will	be	faster	because	you	use	
only	one	training	sample	and	it	starts	improving	itself	
right	away	from	the	first	sample.	 	
	
SGD	 often	 converges	 much	 faster	 compared	 to	 GD	
but	the	error	function	is	not	as	well	minimized	as	in	
the	 case	 of	 GD.	 Often	 in	 most	 cases,	 the	 close	
approximation	that	you	get	in	SGD	for	the	parameter	
values	 are	 enough	 because	 they	 reach	 the	 optimal	
values	and	keep	oscillating	there.	
	
There	are	several	different	flavors	of	SGD,	which	can	
be	all	seen	throughout	literature.	Let's	take	a	look	at	
the	three	most	common	variants:		 	
	
A)	
• randomly	shuffle	samples	in	the	training	set	
• for	 one	 or	 more	 epochs,	 or	 until	 approx.	 cost	
minimum	is	reached	
• for	training	sample	i	
• compute	gradients	and	perform	weight	updates	
B)	
• for	 one	 or	 more	 epochs,	 or	 until	 approx.	 cost	
minimum	is	reached	
• randomly	shuffle	samples	in	the	training	set	
• for	training	sample	i	
• compute	gradients	and	perform	weight	updates	
C)	
• for	iterations	t,	or	until	approx.	cost	minimum	is	
reached:	
• draw	random	sample	from	the	training	set	
• compute	gradients	and	perform	weight	updates	
	
In	scenario	A	,	we	shuffle	the	training	set	only	one	
time	 in	 the	 beginning;	 whereas	 in	 scenario	 B,	 we	
shuffle	the	training	set	after	each	epoch	to	prevent	
repeating	 update	 cycles.	 In	 both	 scenario	 A	 and	
scenario	B,	each	training	sample	is	only	used	once	per	
epoch	 to	 update	 the	 model	 weights.	
In	scenario	C,	we	draw	the	training	samples	randomly	
with	replacement	from	the	training	set.	If	the	number	
of	 iterations	 t	 is	 equal	 to	 the	 number	 of	 training	
samples,	we	learn	the	model	based	on	a	bootstrap	
sample	of	the	training	set.	 	
	
In	the	Gradient	Descent	method,	one	computes	the	
direction	 that	 decreases	 the	 objective	 function	 the	
most	 in	 the	 case	 of	 minimization	 problems.	 But	
sometimes	this	cant	be	quite	costly.	In	most	Machine	
Learning	for	example,	the	objective	function	is	often	
the	 cumulative	 sum	 of	 the	 error	 over	 the	 training	
examples.	But	the	size	of	the	training	examples	set	
might	be	very	large	and	hence	computing	the	actual	
gradient	would	be	computationally	expensive.	 	
	
In	 Stochastic	 Gradient	 (Descent)	 method,	 we	
compute	 an	 estimate	 or	 approximation	 to	 this	
direction.	The	most	simple	way	is	to	just	look	at	one	
training	example	(or	subset	of	training	examples)	and	
compute	 the	 direction	 to	 move	 only	 on	 this	
approximation.	It	is	called	as	Stochastic	because	the	
approximate	direction	that	is	computed	at	every	step	
can	be	thought	of	a	random	variable	of	a	stochastic	
process.	 This	 is	 mainly	 used	 in	 showing	 the	
convergence	of	this	algorithm.	
Recent	 theoretical	 results,	 however,	 show	 that	 the	
runtime	to	get	some	desired	optimization	accuracy	
does	not	increase	as	the	training	set	size	increases.	
	
Stochastic	 Gradient	 Descent	 is	 sensitive	 to	 feature	
scaling,	 so	 it	 is	 highly	 recommended	 to	 scale	 your	
data.	For	example,	scale	each	attribute	on	the	input	
vector	X	to	[0,1]	or	[-1,+1],	or	standardize	it	to	have	
mean	0	and	variance	1.	Note	that	the	same	scaling	
must	 be	 applied	 to	 the	 test	 vector	 to	 obtain	
meaningful	results.		
Empirically,	 we	 found	 that	 SGD	 converges	 after	
observing	 approx.	 10^6	 training	 samples.	 Thus,	 a	
reasonable	 first	 guess	 for	 the	 number	 of	 iterations	
is	n_iter	=	np.ceil(10**6	/	n),	where	n	is	the	size	of	the	
training	set.	
If	you	apply	SGD	to	features	extracted	using	PCA	we	
found	that	it	is	often	wise	to	scale	the	feature	values	
by	some	constant	c	such	that	the	average	L2	norm	of	
the	training	data	equals	one.	
We	found	that	Averaged	SGD	works	best	with	a	larger	
number	of	features	and	a	higher	eta0	
	
Here,	the	term	"stochastic"	comes	from	the	fact	that	
the	gradient	based	on	a	single	training	sample	is	a	
"stochastic	 approximation"	 of	 the	 "true"	 cost
gradient.	 Due	 to	 its	 stochastic	 nature,	 the	 path	
towards	the	global	cost	minimum	is	not	"direct"	as	in	
GD,	but	may	go	"zig-zag"	if	we	are	visualizing	the	cost	
surface	in	a	2D	space.	However,	it	has	been	shown	
that	SGD	almost	surely	converges	to	the	global	cost	
minimum	if	the	cost	function	is	convex	(or	pseudo-
convex).		
	
	
There	might	be	many	reason	but	one	reason	as	to	why	
SG	 is	 preferred	 in	 Machine	 Learning	 is	 because	 it	
helps	 the	 algorithm	 to	 skip	 some	 local	 minima.	
Though	this	is	not	a	theoretically	sound	reason	in	my	
opinion,	the	optimal	points	that	are	computed	using	
SG	are	empirically	better	than	the	GD	method	often.	
	
SGD	 is	 just	 one	 type	 of	 online	 learning	
algorithm.	 	 There	 are	 many	 other	 online	 learning	
algorithms	 that	 may	 not	 depend	 on	 gradients	 (e.g.	
perceptron	algorithm,	bayesian	inference,	etc).	
Mini-Batch	Gradient	Descent	
Mini-batch	gradient	descent	finally	takes	the	best	of	
both	worlds	and	performs	an	update	for	every	mini-
batch	of	n	training	examples:
		
θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))
	
	
This	way,	it	a)	reduces	the	variance	of	the	parameter	
updates,	which	can	lead	to	more	stable	convergence;	
and	 b)	 can	 make	 use	 of	 highly	 optimized	 matrix	
optimizations	 common	 to	 state-of-the-art	 deep	
learning	libraries	that	make	computing	the	gradient	
w.r.t.	a	mini-batch	very	efficient.	Common	mini-batch	
sizes	 range	 between	 50	 and	 256,	 but	 can	 vary	 for	
different	applications.	Mini-batch	gradient	descent	is	
typically	 the	 algorithm	 of	 choice	 when	 training	 a	
neural	network	and	the	term	SGD	usually	is	employed	
also	 when	 mini-batches	 are	 used.	 Note:	 In	
modifications	of	SGD	in	the	rest	of	this	post,	we	leave	
out	 the	 parameters	 x(i:i+n);y(i:i+n)x(i:i+n);y(i:i+n)	 for	
simplicity.	
Challenges	
Vanilla	mini-batch	gradient	descent,	however,	does	
not	 guarantee	 good	 convergence,	 but	 offers	 a	 few	
challenges	that	need	to	be	addressed:	
Choosing	 a	 proper	 learning	 rate	 can	 be	 difficult.	 A	
learning	rate	that	is	too	small	leads	to	painfully	slow	
convergence,	while	a	learning	rate	that	is	too	large	
can	hinder	convergence	and	cause	the	loss	function	
to	fluctuate	around	the	minimum	or	even	to	diverge.	
Learning	rate	schedules	try	to	adjust	the	learning	rate	
during	 training	 by	 e.g.	 annealing,	 i.e.	 reducing	 the	
learning	rate	according	to	a	pre-defined	schedule	or	
when	the	change	in	objective	between	epochs	falls	
below	a	threshold.	These	schedules	and	thresholds,	
however,	have	to	be	defined	in	advance	and	are	thus	
unable	to	adapt	to	a	dataset's	characteristics.	
Additionally,	 the	 same	 learning	 rate	 applies	 to	 all	
parameter	 updates.	 If	 our	 data	 is	 sparse	 and	 our	
features	 have	 very	 different	 frequencies,	 we	 might	
not	want	to	update	all	of	them	to	the	same	extent,	
but	 perform	 a	 larger	 update	 for	 rarely	 occurring	
features.	
Another	 key	 challenge	 of	 minimizing	 highly	 non-
convex	error	functions	common	for	neural	networks	
is	 avoiding	 getting	 trapped	 in	 their	 numerous	
suboptimal	local	minima.	Some	scientists	argue	that	
the	difficulty	arises	in	fact	not	from	local	minima	but	
from	saddle	points,	i.e.	points	where	one	dimension	
slopes	 up	 and	 another	 slopes	 down.	 These	 saddle	
points	 are	 usually	 surrounded	 by	 a	 plateau	 of	 the	
same	error,	which	makes	it	notoriously	hard	for	SGD	
to	 escape,	 as	 the	 gradient	 is	 close	 to	 zero	 in	 all	
dimensions.	
There	are	some	algorithms	that	are	widely	used	by	
the	deep	learning	community	to	deal	with	the	
aforementioned	challenges.	Below	are	some	of	
them.	
	
Momentum	
SGD	has	trouble	navigating	ravines,	i.e.	areas	where	
the	 surface	 curves	 much	 more	 steeply	 in	 one	
dimension	 than	 in	 another,	 which	 are	 common	
around	 local	 optima.	 In	 these	 scenarios,	 SGD	
oscillates	across	the	slopes	of	the	ravine	while	only	
making	hesitant	progress	along	the	bottom	towards	
the	local	optimum	as	in	Image	2.
2:	SGD	without	momentum	
															
	
								3:	SGD	with	momentum	
	 	
Momentum	is	a	method	that	helps	accelerate	SGD	in	
the	 relevant	 direction	 and	 dampens	 oscillations	 as	
can	 be	 seen	 in	 Image	 3.	 It	 does	 this	 by	 adding	 a	
fraction	γ	of	the	update	vector	of	the	past	time	step	
to	the	current	update	vector:	
Essentially,	when	using	momentum,	we	push	a	ball	
down	 a	 hill.	 The	 ball	 accumulates	 momentum	 as	 it	
rolls	downhill,	becoming	faster	and	faster	on	the	way	
(until	 it	 reaches	 its	 terminal	 velocity	 if	 there	 is	 air	
resistance,	i.e.	γ<1).	The	same	thing	happens	to	our	
parameter	updates:	The	momentum	term	increases	
for	 dimensions	 whose	 gradients	 point	 in	 the	 same	
directions	and	reduces	updates	for	dimensions	whose	
gradients	 change	 directions.	 As	 a	 result,	 we	 gain	
faster	convergence	and	reduced	oscillation.	
Nesterov	Accelerated	Gradient	
However,	a	ball	that	rolls	down	a	hill,	blindly	following	
the	slope,	is	highly	unsatisfactory.	We'd	like	to	have	a	
smarter	ball,	a	ball	that	has	a	notion	of	where	it	is	
going	so	that	it	knows	to	slow	down	before	the	hill	
slopes	up	again.	
Nesterov	accelerated	gradient	(NAG)	is	a	way	to	give	
our	 momentum	 term	 this	 kind	 of	 prescience.	 We	
know	that	we	will	use	our	momentum	term	γvt−1	to	
move	the	parameters	θ.	Computing	θ−γvt−1	thus	gives	
us	 an	 approximation	 of	 the	 next	 position	 of	 the	
parameters	 (the	 gradient	 is	 missing	 for	 the	 full	
update),	 a	 rough	 idea	 where	 our	 parameters	 are	
going	to	be.	We	can	now	effectively	look	ahead	by	
calculating	 the	 gradient	 not	 w.r.t.	 to	 our	 current	
parameters	 θ	 but	 w.r.t.	 the	 approximate	 future	
position	of	our	parameters:	
Again,	we	set	the	momentum	term	γ	to	a	value	of	
around	 0.9.	 While	 Momentum	 first	 computes	 the	
current	gradient	(small	blue	vector	in	Image	4)	and	
then	takes	a	big	jump	in	the	direction	of	the	updated	
accumulated	 gradient	 (big	 blue	 vector),	 NAG	 first	
makes	 a	 big	 jump	 in	 the	 direction	 of	 the	 previous	
accumulated	gradient	(brown	vector),	measures	the	
gradient	and	then	makes	a	correction	(green	vector).	
This	anticipatory	update	prevents	us	from	going	too	
fast	 and	 results	 in	 increased	 responsiveness,	 which	
has	significantly	increased	the	performance	of	RNNs	
on	a	number	of	tasks.	
Image	4	
	
	
	
Adagrad	
Adagrad	 is	 an	 algorithm	 for	 gradient-based	
optimization	that	does	just	this:	It	adapts	the	learning	
rate	to	the	parameters,	performing	larger	updates	for	
infrequent	 and	 smaller	 updates	 for	 frequent	
parameters.	 For	 this	 reason,	 it	 is	 well-suited	 for	
dealing	 with	 sparse	 data.	 Adagrad	 has	 greatly	
improved	 the	 robustness	 of	 SGD	 and	 used	 it	 for	
training	 large-scale	 neural	 nets	 at	 Google,	 which	 --	
among	 other	 things	 --	 learned	 to	 recognize	 cats	 in	
YouTube	videos.		
Previously,	 we	 performed	 an	 update	 for	 all	
parameters	θ	at	once	as	every	parameter	θi	used	the	
same	 learning	 rate	 η.	 As	 Adagrad	 uses	 a	 different	
learning	 rate	 for	 every	 parameter	 θi	 at	 every	 time	
step	 t,	 we	 first	 show	 Adagrad's	 per-parameter	
update,	 which	 we	 then	 vectorize.	 For	 brevity,	 we	
set	 gt,i	 to	 be	 the	 gradient	 of	 the	 objective	 function	
w.r.t.	to	the	parameter	θi	at	time	step	t:	
One	of	Adagrad's	main	benefits	is	that	it	eliminates	
the	 need	 to	 manually	 tune	 the	 learning	 rate.	 Most	
implementations	use	a	default	value	of	0.01	and	leave	
it	at	that.	
Adagrad's	main	weakness	is	its	accumulation	of	the	
squared	 gradients	 in	 the	 denominator:	 Since	 every	
added	term	is	positive,	the	accumulated	sum	keeps	
growing	 during	 training.	 This	 in	 turn	 causes	 the
learning	 rate	 to	 shrink	 and	 eventually	 become	
infinitesimally	small,	at	which	point	the	algorithm	is	
no	longer	able	to	acquire	additional	knowledge.	The	
following	algorithms	aim	to	resolve	this	flaw.	
Adadelta	
Adadelta	 is	 an	 extension	 of	 Adagrad	 that	 seeks	 to	
reduce	 its	 aggressive,	 monotonically	 decreasing	
learning	 rate.	 Instead	 of	 accumulating	 all	 past	
squared	gradients,	Adadelta	restricts	the	window	of	
accumulated	past	gradients	to	some	fixed	size	w.	
Instead	 of	 inefficiently	 storing	 w	 previous	 squared	
gradients,	the	sum	of	gradients	is	recursively	defined	
as	a	decaying	average	of	all	past	squared	gradients.	
The	 running	 average	 E[g2
]t	 at	 time	 step	 t	 then	
depends	(as	a	fraction	γ	similarly	to	the	Momentum	
term)	only	on	the	previous	average	and	the	current	
gradient.	
RMSprop	
RMSprop	 is	 an	 unpublished,	 adaptive	 learning	 rate	
method	proposed	by	Geoff	Hinton		
RMSprop	 and	 Adadelta	 have	 both	 been	 developed	
independently	around	the	same	time	stemming	from	
the	 need	 to	 resolve	 Adagrad's	 radically	 diminishing	
learning	rates.	RMSprop	in	fact	is	identical	to	the	first	
update	vector	of	Adadelta	that	we	derived	above:	
RMSprop	 as	 well	 divides	 the	 learning	 rate	 by	 an	
exponentially	decaying	average	of	squared	gradients.	
Hinton	 suggests	 γ	 to	 be	 set	 to	 0.9,	 while	 a	 good	
default	value	for	the	learning	rate	η	is	0.001.	
Adam	
Adaptive	 Moment	 Estimation	 (Adam)	 	 is	 another	
method	 that	 computes	 adaptive	 learning	 rates	 for	
each	 parameter.	 In	 addition	 to	 storing	 an	
exponentially	 decaying	 average	 of	 past	 squared	
gradients	vt	like	Adadelta	and	RMSprop,	Adam	also	
keeps	 an	 exponentially	 decaying	 average	 of	 past	
gradients	mt,	similar	to	momentum:	
mt=β1mt−1+(1−β1)gt	
vt=β2vt−1+(1−β2)g2t	
mt	 and	 vt	 are	 estimates	 of	 the	 first	 moment	 (the	
mean)	 and	 the	 second	 moment	 (the	 uncentered	
variance)	 of	 the	 gradients	 respectively,	 hence	 the	
name	of	the	method.	As	mt	and	vt	are	initialized	as	
vectors	of	0's,	the	authors	of	Adam	observe	that	they	
are	biased	towards	zero,	especially	during	the	initial	
time	steps,	and	especially	when	the	decay	rates	are	
small	(i.e.	β1	and	β2	are	close	to	1).	
	
ADDITIONAL	STRATEGIES	FOR	OPTIMIZING	SGD	
Finally,	we	introduce	additional	strategies	that	can	be	
used	 alongside	 any	 of	 the	 previously	 mentioned	
algorithms	 to	 further	 improve	 the	 performance	 of	
SGD.		
Shuffling	And	Curriculum	Learning	
Generally,	 we	 want	 to	 avoid	 providing	 the	 training	
examples	in	a	meaningful	order	to	our	model	as	this	
may	bias	the	optimization	algorithm.	Consequently,	it	
is	often	a	good	idea	to	shuffle	the	training	data	after	
every	epoch.	
On	the	other	hand,	for	some	cases	where	we	aim	to	
solve	 progressively	 harder	 problems,	 supplying	 the	
training	examples	in	a	meaningful	order	may	actually	
lead	 to	 improved	 performance	 and	 better	
convergence.	 The	 method	 for	 establishing	 this	
meaningful	order	is	called	Curriculum	Learning.	
Batch	Normalization	
To	facilitate	learning,	we	typically	normalize	the	initial	
values	 of	 our	 parameters	 by	 initializing	 them	 with	
zero	mean	and	unit	variance.	As	training	progresses	
and	we	update	parameters	to	different	extents,	we	
lose	 this	 normalization,	 which	 slows	 down	 training	
and	 amplifies	 changes	 as	 the	 network	 becomes	
deeper.	
Batch	 normalization	 reestablishes	 these	
normalizations	for	every	mini-batch	and	changes	are	
back-propagated	 through	 the	 operation	 as	 well.	 By	
making	normalization	part	of	the	model	architecture,
we	are	able	to	use	higher	learning	rates	and	pay	less	
attention	 to	 the	 initialization	 parameters.	 Batch	
normalization	 additionally	 acts	 as	 a	 regularizer,	
reducing	(and	sometimes	even	eliminating)	the	need	
for	Dropout.	
Early	Stopping	
You	should	thus	always	monitor	error	on	a	validation	
set	during	training	and	stop	(with	some	patience)	if	
your	validation	error	does	not	improve	enough.	
Gradient	Noise	
Adding	noise	makes	networks	more	robust	to	poor	
initialization	and	helps	training	particularly	deep	and	
complex	 networks.	 It	 is	 suspected	 that	 the	 added	
noise	gives	the	model	more	chances	to	escape	and	
find	new	local	minima,	which	are	more	frequent	for	
deeper	models.	
We	have	then	investigated	algorithms	that	are	most	
commonly	 used	 for	 optimizing	 SGD:	 Momentum,	
Nesterov	 accelerated	 gradient,	 Adagrad,	 Adadelta,	
RMSprop,	 Adam,	 as	 well	 as	 different	 algorithms	 to	
optimize	 asynchronous	 SGD.	 Finally,	 we've	
considered	other	strategies	to	improve	SGD	such	as	
shuffling	 and	 curriculum	 learning,	 batch	
normalization,	and	early	stopping.	
SGD	has	been	successfully	applied	to	large-scale	and	
sparse	machine	learning	problems	often	encountered	
in	text	classification	and	natural	language	processing.	
Given	 that	 the	 data	 is	 sparse,	 the	 classifiers	 in	 this	
module	easily	scale	to	problems	with	more	than	10^5	
training	examples	and	more	than	10^5	features.	
Advantages	of	Stochastic	Gradient	Descent	
• Efficiency.	
• Ease	of	implementation	(lots	of	opportunities	for	
code	tuning).	
Disadvantages	of	Stochastic	Gradient	Descent	
• SGD	requires	a	number	of	hyperparameters	such	
as	the	regularization	parameter	and	the	number	
of	iterations.	
• SGD	is	sensitive	to	feature	scaling.	
• The	class	SGDClassifier	in	Sklearn	implements	a	
plain	stochastic	gradient	descent	learning	routine	
which	 supports	 different	 loss	 functions	 and	
penalties	for	classification.	
• The	 class	 SGDRegressor	 implements	 a	 plain	
stochastic	 gradient	 descent	 learning	 routine	
which	 supports	 different	 loss	 functions	 and	
penalties	 to	 fit	 linear	 regression	
models.	 SGDRegressor	 is	 well	 suited	 for	
regression	 problems	 with	 a	 large	 number	 of	
training	samples	(>	10.000),	for	other	problems	
we	recommend	Ridge,	Lasso,	or	ElasticNet.	
The	major	advantage	of	SGD	is	its	efficiency,	which	is	
basically	linear	in	the	number	of	training	examples.	If	
X	 is	 a	 matrix	 of	 size	 (n,	 p)	 training	 has	 a	 cost	
of	 ,	 where	 k	 is	 the	 number	 of	 iterations	
(epochs)	and	 	is	the	average	number	of	non-zero	
attributes	per	sample.	
Application	
SGD	 algorithms	 can	 therefore	 be	 used	 in	 scenarios	
where	it	is	expensive	to	iterate	over	the	entire	dataset	
several	 times.	 This	 algorithm	 is	 ideal	 for	 streaming	
analytics,	 where	 we	 can	 discard	 the	 data	 after	
processing	it.	We	can	use	modified	versions	of	SGD	
such	 as	 mini	 batch	 GD	 etc,	 with	 adjusted	 learning	
rates	 to	 get	 better	 results	 as	 seen	 above.	 This	
algorithm	can	be	used	as	the	cost	function	of	several	
classification	and	regression	techniques.		
SGD	in	a	real	time	scenario	learns	continuously	as	and	
when	the	data	comes	in,	somewhat	like	reinforced	
machine	learning	using	feedback.		An	example	of	this	
can	be	a	real	time	system	where,	it	is	judged	on	the	
basis	 of	 some	 parameters,	 whether	 a	 customer	 is	
genuine	or	not.	Based	on	historical	data	the	first	set	
of	parameters	are	learnt.	As	a	new	customer	comes	
in	with	a	new	set	of	features,	it	falls	into	either	of	the	
two	 groups	 :	 genuine,	 not	 genuine.	 Based	 on	 this	
completed	classification,	the	prior	parameters	of	the	
system	 are	 updated,	 using	 the	 features	 that	 the	
customer	brought	in.	Over	time	the	system	becomes	
much	smarter	that	what	it	started	with.	Apple’s	Siri	or	
Microsoft’s	Cortana	assistants	are	also	an	example	of	
real	time	learning,	which	correct	themselves	on	the	
go.
Conclusion	
	
In	 conclusion	 we	 saw	 that	 to	 manage	 large	 scale	
data	we	can	use	variants	of	gradient	descent,	batch	
gradient	descent,	stochastic	gradient	descent	(SGD),	
and	 mini	 gradient	 descent	 in	 the	 order	 of	
improvement	over	the	previous	one.	Each	of	which	
differs	in	how	much	data	we	use	to	compute	the	
gradient	 of	 the	 objective	 function.	 Given	 the	
ubiquity	 of	 large-scale	 data	 solutions	 and	 the	
availability	of	low-commodity	clusters,	distributing	
SGD	to	speed	it	up	further	is	an	obvious	choice.	SGD	
by	itself	is	inherently	sequential:	Step-by-step,	we	
progress	further	towards	the	minimum.	Running	it	
provides	 good	 convergence	 but	 can	 be	 slow	
particularly	on	large	datasets.	In	contrast,	running	
SGD	 asynchronously	 is	 faster,	 but	 suboptimal	
communication	between	workers	can	lead	to	poor	
convergence.	 Additionally,	 we	 can	 also	 parallelize	
SGD	on	one	machine	without	the	need	for	a	large	
computing	cluster.		
	
Depending	on	the	amount	of	data,	we	make	a	trade-
off	between	the	accuracy	of	the	parameter	update	
and	 the	 time	 it	 takes	 to	 perform	 an	 update.	 To	
overcome	this	challenge	of	parameter	update	using	
the	learning	rate	we	looked	at	various	methods	that	
address	this	problem	such	as	momentum,	Nesterov	
Accelerated	gradient,	Adagrad,	Adadelta,	RMSProp	
and	Adam.	In	summary,	RMSprop	is	an	extension	of	
Adagrad	 that	 deals	 with	 its	 radically	 diminishing	
learning	rates.	It	is	identical	to	Adadelta,	except	that	
Adadelta	uses	the	RMS	of	parameter	updates	in	the	
numinator	 update	 rule.	 Adam,	 finally,	 adds	 bias-
correction	 and	 momentum	 to	 RMSprop.	 Insofar,	
RMSprop,	 Adadelta,	 and	 Adam	 are	 very	 similar	
algorithms	that	do	well	in	similar	circumstances.	Its	
bias-correction	 helps	 Adam	 slightly	 outperform	
RMSprop	 towards	 the	 end	 of	 optimization	 as	
gradients	become	sparser.	Insofar,	Adam	might	be	
the	best	overall	choice.	
	
Interestingly,	 many	 recent	 papers	 use	 vanilla	 SGD	
without	 momentum	 and	 a	 simple	 learning	 rate	
annealing	schedule.	As	has	been	shown,	SGD	usually	
achieves	 to	 find	 a	 minimum,	 but	 it	 might	 take	
significantly	 longer	 than	 with	 some	 of	 the	
optimizers,	 is	 much	 more	 reliant	 on	 a	 robust	
initialization	and	annealing	schedule,	and	may	get	
stuck	 in	 saddle	 points	 rather	 than	 local	 minima.	
Consequently,	 if	 you	 care	 about	 fast	 convergence	
and	train	a	deep	or	complex	neural	network,	you	
should	 choose	 one	 of	 the	 adaptive	 learning	 rate	
methods.	
References	
[1]Bottou,	Léon	(1998).	"Online	Algorithms	and	
Stochastic	Approximations".		
[2]Bottou,	Léon.	"Large-scale	machine	learning	
with	SGD."		
[3]Bottou,	Léon.	"SGD	tricks."	Neural	Networks:	
Tricks	of	the	Trade.		
[4]https://www.quora.com/Whats-the-difference-
between-gradient-descent-and-stochastic-gradient-
descent	
[5]http://ufldl.stanford.edu/tutorial/supervised/Op
timizationStochasticGradientDescent/	
[6]http://scikit-learn.org/stable/modules/sgd.html	
[7]http://sebastianruder.com/optimizing-gradient-
descent/

More Related Content

What's hot

Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxShubham Jaybhaye
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaEdureka!
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentMuhammad Rasel
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningRakshith Sathish
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Activation functions
Activation functionsActivation functions
Activation functionsPRATEEK SAHU
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial NetworksDong Heon Cho
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxShubham Jaybhaye
 
07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
 

What's hot (20)

Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Activation function
Activation functionActivation function
Activation function
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
07 regularization
07 regularization07 regularization
07 regularization
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 

Viewers also liked

Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersMadhumita Tamhane
 
Fast Distributed Online Classification
Fast Distributed Online ClassificationFast Distributed Online Classification
Fast Distributed Online ClassificationPrasad Chalasani
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Daniel Lewis
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 

Viewers also liked (8)

Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
Fast Distributed Online Classification
Fast Distributed Online ClassificationFast Distributed Online Classification
Fast Distributed Online Classification
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 

Similar to Stochastic gradient descent and its tuning

Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimizationindico data
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptxkumarkaushal17
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHMGRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHMijscai
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterIJERA Editor
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptxMurindanyiSudi1
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningVenkata Karthik Gullapalli
 
3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptxmunwar7
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 

Similar to Stochastic gradient descent and its tuning (20)

Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimization
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHMGRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital Predistorter
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptx
 
Ann a Algorithms notes
Ann a Algorithms notesAnn a Algorithms notes
Ann a Algorithms notes
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
 
3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 

Stochastic gradient descent and its tuning

  • 1. Stochastic Gradient Descent algorithm & its tuning Mohamed, Qadri SUMMARY This paper talks about optimization algorithms used for big data applications. We start with explaining the gradient descent algorithms and its limitations. Later we delve into the stochastic gradient descent algorithms and explore methods to improve it it by adjusting learning rates. GRADIENT DESCENT Gradient descent is a first order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals. Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset). The cost function J(⋅), the sum of squared errors (SSE), can be written as: The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient where η is the learning rate. The weights are then updated after each epoch via the following update rule: where Δw is a vector that contains the weight updates of each weight coefficient w, which are computed as follows: Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows: GRADIENT DESCENT VARIANTS There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
  • 2. Batch Gradient Descent Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset: As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly. In code, batch gradient descent looks something like this: for i in range(nb_epochs): params_grad = evaluate_gradient(loss_function, data, params) params = params - learning_rate * params_grad For a pre-defined number of epochs, we first compute the gradient vector weights_grad of the loss function for the whole dataset w.r.t. our parameter vector params. We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces. Stochastic Gradient Descent Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i)x(i)and label y(i)y(i): θ=θ−η⋅∇θJ(θ;x(i);y(i)) Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Image. While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively. Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example. for i in range(nb_epochs): np.random.shuffle(data) for example in data: params_grad = evaluate_gradient(loss_function, example, params) params = params - learning_rate * params_grad In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration. Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the
  • 3. other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample. SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there. There are several different flavors of SGD, which can be all seen throughout literature. Let's take a look at the three most common variants: A) • randomly shuffle samples in the training set • for one or more epochs, or until approx. cost minimum is reached • for training sample i • compute gradients and perform weight updates B) • for one or more epochs, or until approx. cost minimum is reached • randomly shuffle samples in the training set • for training sample i • compute gradients and perform weight updates C) • for iterations t, or until approx. cost minimum is reached: • draw random sample from the training set • compute gradients and perform weight updates In scenario A , we shuffle the training set only one time in the beginning; whereas in scenario B, we shuffle the training set after each epoch to prevent repeating update cycles. In both scenario A and scenario B, each training sample is only used once per epoch to update the model weights. In scenario C, we draw the training samples randomly with replacement from the training set. If the number of iterations t is equal to the number of training samples, we learn the model based on a bootstrap sample of the training set. In the Gradient Descent method, one computes the direction that decreases the objective function the most in the case of minimization problems. But sometimes this cant be quite costly. In most Machine Learning for example, the objective function is often the cumulative sum of the error over the training examples. But the size of the training examples set might be very large and hence computing the actual gradient would be computationally expensive. In Stochastic Gradient (Descent) method, we compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. This is mainly used in showing the convergence of this algorithm. Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase as the training set size increases. Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable first guess for the number of iterations is n_iter = np.ceil(10**6 / n), where n is the size of the training set. If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by some constant c such that the average L2 norm of the training data equals one. We found that Averaged SGD works best with a larger number of features and a higher eta0 Here, the term "stochastic" comes from the fact that the gradient based on a single training sample is a "stochastic approximation" of the "true" cost
  • 4. gradient. Due to its stochastic nature, the path towards the global cost minimum is not "direct" as in GD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space. However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex (or pseudo- convex). There might be many reason but one reason as to why SG is preferred in Machine Learning is because it helps the algorithm to skip some local minima. Though this is not a theoretically sound reason in my opinion, the optimal points that are computed using SG are empirically better than the GD method often. SGD is just one type of online learning algorithm. There are many other online learning algorithms that may not depend on gradients (e.g. perceptron algorithm, bayesian inference, etc). Mini-Batch Gradient Descent Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini- batch of n training examples: θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n)) This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. Note: In modifications of SGD in the rest of this post, we leave out the parameters x(i:i+n);y(i:i+n)x(i:i+n);y(i:i+n) for simplicity. Challenges Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed: Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge. Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset's characteristics. Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features. Another key challenge of minimizing highly non- convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Some scientists argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions. There are some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges. Below are some of them. Momentum SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Image 2.
  • 5. 2: SGD without momentum 3: SGD with momentum Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction γ of the update vector of the past time step to the current update vector: Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. Nesterov Accelerated Gradient However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) is a way to give our momentum term this kind of prescience. We know that we will use our momentum term γvt−1 to move the parameters θ. Computing θ−γvt−1 thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters θ but w.r.t. the approximate future position of our parameters: Again, we set the momentum term γ to a value of around 0.9. While Momentum first computes the current gradient (small blue vector in Image 4) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks. Image 4 Adagrad Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. Adagrad has greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which -- among other things -- learned to recognize cats in YouTube videos. Previously, we performed an update for all parameters θ at once as every parameter θi used the same learning rate η. As Adagrad uses a different learning rate for every parameter θi at every time step t, we first show Adagrad's per-parameter update, which we then vectorize. For brevity, we set gt,i to be the gradient of the objective function w.r.t. to the parameter θi at time step t: One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that. Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the
  • 6. learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw. Adadelta Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w. Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g2 ]t at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient. RMSprop RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above: RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests γ to be set to 0.9, while a good default value for the learning rate η is 0.001. Adam Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum: mt=β1mt−1+(1−β1)gt vt=β2vt−1+(1−β2)g2t mt and vt are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As mt and vt are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). ADDITIONAL STRATEGIES FOR OPTIMIZING SGD Finally, we introduce additional strategies that can be used alongside any of the previously mentioned algorithms to further improve the performance of SGD. Shuffling And Curriculum Learning Generally, we want to avoid providing the training examples in a meaningful order to our model as this may bias the optimization algorithm. Consequently, it is often a good idea to shuffle the training data after every epoch. On the other hand, for some cases where we aim to solve progressively harder problems, supplying the training examples in a meaningful order may actually lead to improved performance and better convergence. The method for establishing this meaningful order is called Curriculum Learning. Batch Normalization To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper. Batch normalization reestablishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture,
  • 7. we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing (and sometimes even eliminating) the need for Dropout. Early Stopping You should thus always monitor error on a validation set during training and stop (with some patience) if your validation error does not improve enough. Gradient Noise Adding noise makes networks more robust to poor initialization and helps training particularly deep and complex networks. It is suspected that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models. We have then investigated algorithms that are most commonly used for optimizing SGD: Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, as well as different algorithms to optimize asynchronous SGD. Finally, we've considered other strategies to improve SGD such as shuffling and curriculum learning, batch normalization, and early stopping. SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features. Advantages of Stochastic Gradient Descent • Efficiency. • Ease of implementation (lots of opportunities for code tuning). Disadvantages of Stochastic Gradient Descent • SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations. • SGD is sensitive to feature scaling. • The class SGDClassifier in Sklearn implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification. • The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet. The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a matrix of size (n, p) training has a cost of , where k is the number of iterations (epochs) and is the average number of non-zero attributes per sample. Application SGD algorithms can therefore be used in scenarios where it is expensive to iterate over the entire dataset several times. This algorithm is ideal for streaming analytics, where we can discard the data after processing it. We can use modified versions of SGD such as mini batch GD etc, with adjusted learning rates to get better results as seen above. This algorithm can be used as the cost function of several classification and regression techniques. SGD in a real time scenario learns continuously as and when the data comes in, somewhat like reinforced machine learning using feedback. An example of this can be a real time system where, it is judged on the basis of some parameters, whether a customer is genuine or not. Based on historical data the first set of parameters are learnt. As a new customer comes in with a new set of features, it falls into either of the two groups : genuine, not genuine. Based on this completed classification, the prior parameters of the system are updated, using the features that the customer brought in. Over time the system becomes much smarter that what it started with. Apple’s Siri or Microsoft’s Cortana assistants are also an example of real time learning, which correct themselves on the go.
  • 8. Conclusion In conclusion we saw that to manage large scale data we can use variants of gradient descent, batch gradient descent, stochastic gradient descent (SGD), and mini gradient descent in the order of improvement over the previous one. Each of which differs in how much data we use to compute the gradient of the objective function. Given the ubiquity of large-scale data solutions and the availability of low-commodity clusters, distributing SGD to speed it up further is an obvious choice. SGD by itself is inherently sequential: Step-by-step, we progress further towards the minimum. Running it provides good convergence but can be slow particularly on large datasets. In contrast, running SGD asynchronously is faster, but suboptimal communication between workers can lead to poor convergence. Additionally, we can also parallelize SGD on one machine without the need for a large computing cluster. Depending on the amount of data, we make a trade- off between the accuracy of the parameter update and the time it takes to perform an update. To overcome this challenge of parameter update using the learning rate we looked at various methods that address this problem such as momentum, Nesterov Accelerated gradient, Adagrad, Adadelta, RMSProp and Adam. In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias- correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice. Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods. References [1]Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". [2]Bottou, Léon. "Large-scale machine learning with SGD." [3]Bottou, Léon. "SGD tricks." Neural Networks: Tricks of the Trade. [4]https://www.quora.com/Whats-the-difference- between-gradient-descent-and-stochastic-gradient- descent [5]http://ufldl.stanford.edu/tutorial/supervised/Op timizationStochasticGradientDescent/ [6]http://scikit-learn.org/stable/modules/sgd.html [7]http://sebastianruder.com/optimizing-gradient- descent/